ABC

The folder ABC.tar.gz contain the sources codes, executables (linux 64x) and scripts used to run the ABC analysis of the study Nabholz et al. 2014 Transcriptome population genomics reveals severe bottleneck and domestication cost in the African rice (O. glaberrima).

The software seq_stat_2pop_from_MS.cpp use the Bio++ library and, therefore, should be used and modified under the CeCILL free software licence (GPL compatible).

FILES AND USAGE:

The ABC analysis is a pipeline of several programs call by the script run_ABC.bash.

The core of the analysis involves simulations done by ms software : http://home.uchicago.edu/rhudson1/source/mksamples.html

ms is called by the perl script ms_simul.pl. This script is also intended to generate priors parameters (store in par.*.txt file).
Moreover, ms_simul.pl converts the ms results into DNA sequences from which summaries statistics are computed using the program
seq_stat_2pop_from_MS.

Simulations are ran by sets of 100. The number of sets is controlled by arguments passed to the run_ABC.bash script. The first argument
controls the beginning of the simulations and, the second the end of the simulations.

For example, the command :
./run_ABC.bash 100 300
runs 200 sets of 100 simulations (20000 simulations in total) named 100 to 300.
This parameterization allows to run several simulations in parallel.

All the summary statistics and parameters are store in the folder sim1.

Once the simulations are performed, one can retrieve the results using the script ./retrieve_statistics.bash. This script automatically stores
the summary statistics in the file Statistics.Sim1.csv and the parameters in the file Paramaters.Sim1.csv.

Last, the Size.txt file contains the number of sites per windows and allow to scale the simulations on the number of sites available.

Please see the manuscript for details regarding the demographic model.

COMPILATION:
Once, you have installed Bio++, you can compile the programs using the command:

g++ -g ./seq_stat_2pop_from_MS.cpp -o seq_stat_2pop_from_MS \
 -lbpp-popgen -lbpp-phyl -lbpp-seq -lbpp-core
 
EXAMPLE:
To run 200 simulations, type the command:

./run_ABC.bash 1 2
./retrieve_statistics.bash

In the file Statistics.Sim1.csv are store the following summary statistics:

ID : ID number of the simulations
alpha : Bottleneck parameter (alpha = N_bott/N_0) is the same as alpha in the file Paramaters.Sim1.csv
f : Not relevant here (should be discarded)
S_Cu : Number of SNP in the cultivated population
S_Sa : Number of SNP in the wild population
var_S_Cu : Variance in S_Cu
var_S_Sa : Variance in S_Sa
W_Cu : Theta of Watterson in the cultivated population
W_Sa : Theta of Watterson in the wild population
P_Cu : Mean nucleotide diversity in the cultivated population
P_Sa : Mean nucleotide diversity in the wild population
D_Cu : Tajima's D in the cultivated population
D_Sa : Tajima's D in the wild population
var_D_Cu : Variance in Tajima's D in the cultivated population
var_D_Sa : Variance in Tajima's D in the wild population
Fst :  Fst of Hudson et al. 1992
H_Cu :  ThetaH (eq. 3) in Fay and Wu 2000 computed for the cultivated population
H_Sa :  ThetaH (eq. 3) in Fay and Wu 2000 computed for the wild population

In the file Parameters.Sim1.csv are store the following parameters:

alpha  : Bottleneck parameters (alpha = N_bott/N_0)
t_split :  Time of the split (in 2Ne generations) between the wild and the cultivated populations (here, always 3000 generations)
t_bott : Length of the bottleneck (in 2Ne generations). Here always 1000 generations
mu : mutation rate (fixed to 10^-8)
N : Ancestral population size (parameter N_0 in the manuscript)