fastsimcoal2
through strataG
The program, fastsimcoal2
, is an implementation of a fast and powerful coalescent simulator. The program comes as a command line executable (current version is fsc26
) which operates by reading a formatted text file. Output is then written to a set of formatted text files in a folder for the run.
The R package, strataG
contains a set of functions designed to ease interaction with fastsimcoal2
, sampling of scenario parameters, and downstream analyses in R. The fastsimcoal2
manual is an excellent resource for details on input file formatting, command line options, and simulator usage. This vignette is not designed to replace that manual, but rather be a guide for how to run the simulator through strataG
to both produce data and set up models for demographic parameter estimation.
1. The latest version of the R package strataG
can be installed from github:
if(!require(devtools)) install.packages(devtools)
devtools::install_github("ericarcher/strataG", dependencies = TRUE)
2. You will also need to download and install the latest version (2.6.0.2 as of this writing) of fastsimcoal2
from: http://cmpg.unibe.ch/software/fastsimcoal2/
Once the file has been downloaded and uncompressed, fastsimcoal2
will need to be placed somewhere in your system’s PATH so that strataG
can access it from the command line in whatever the current working directory is. There are different methods for editing the PATH depending on your operating system. Here are a couple of links describing the process:
Windows PCs: https://www.howtogeek.com/118594/how-to-edit-your-system-path-for-easy-command-line-access/
Mac/UNIX: https://www.architectryan.com/2012/10/02/add-to-the-path-on-mac-os-x-mountain-lion/
3. Test that it has been correctly installed in the path by opening a terminal window (in RStudio, you can use the menu option Tools|Terminal|New Terminal). At the prompt, type fsc26
and press Enter. If it has been correctly installed, you should see a list of command line options for fastsimcoal2
.
There are four main steps to carry out for every fastsimcoal2
run:
fscSettingsXXX()
functions.fscWrite()
.fscRun()
.fscRead()
.First we clear the workspace and load the strataG
package.
rm(list = ls())
library(strataG)
Registered S3 method overwritten by 'spdep':
method from
plot.mst ape
For our initial example, we will walk through the steps to set up and run a simple simulation of a population with 2000 individuals from which we draw 10 samples represented by 20 microsatellite loci with a mutation rate of 10-4.
The first step is to specify the simulation parameter settings. There are four categories of parameters that can be specified: deme information, migration rates, historical events, and genetic information. To create the input for each category, there is a fscSettingsX()
function, where “X” is either “Demes”, “Migration”, “Events”, or “Genetics”. The minimum parameters necessary to start a simulation are descriptions of the deme and the genetic data.
Here we use the fscDeme()
function to specify the deme size and sample size. We will leave the other possible arguments, sample.time
, inbreeding
, and growth
to their defaults. Here we specify a deme with 1000 randomly mating individuals, of which we want to sample 4 individuals.
deme0 <- fscDeme(deme.size = 1000, sample.size = 4)
The deme input is then specified by supplying each defined deme to the fscSettingsDemes()
function:
demes <- fscSettingsDemes(deme0)
Note that because fastsimcoal2
generates haploid genes, we have to specify the ploidy of the desired output via the ploidy
argument to fscSettingsDemes()
. This value is then used when writing the input files as a multiplier for deme and sample size to ensure that we are simulating enough data for the requested number of individuals. By default, ploidy = 2
, so we only need to change it if we specifically want to simulate a haploid or polyploid population.
Along with the deme size and sample size, we also have the option of specifying a time in the past when the samples are to be taken (sample.time
), which defaults to 0 meaning the present time, an inbreeding coefficient (inbreeding
), and a growth rate for the deme (growth
). The latter two arguments default to 0 as well so we don’t have to explicitly define them unless necessary.
Next, we specify the microsattelites that we want to simulate. There are four marker types that we could use: DNA sequences, SNPs, microsattelites, and “standard” data, which is a simple infinite alleles marker. Each is specified with its own fscBlock_xxx()
function, each of which has arguments specific to that marker type. Here is the microsatellite specification:
msats <- fscBlock_microsat(num.loci = 1, mut.rate = 1e-3)
We have specified this as a single locus (num.loci = 1
) because each function call creates a “block” of linked loci. The recombination rate among loci within a block can be specified with the recomb.rate
argument. However, we want 5 unlinked loci, so we will indicate this by putting them on 5 separate “chromosomes”. We create the input for genetic data with a call to fscSettingsGenetics()
:
genetics <- fscSettingsGenetics(msats, num.chrom = 5)
Now that we have our parameters specified, we will write the input file using the fscWrite()
function. This function both writes an input file and returns a list of parameters.
ex1.params <- fscWrite(demes = demes, genetics = genetics, label = "ex1")
This writes a parameter input file in the current working directory, called ex1.par. That file looks like this:
//Number of population samples (demes)
1
//Population effective sizes (number of genes)
2000
//Sample sizes, times, inbreeding
8 0 0
//Growth rates: negative growth implies population expansion
0
//Number of migration matrices: 0 implies no migration between demes
0
//Historical events: time, source, sink, migrants, new size, growth rate, migr. matrix
0
//Number of independent loci [chromosomes]
5 0
//Per chromosome: Number of linkage blocks
1
//Per block: data type, num loci, rec. rate and mut rate + optional parameters
MICROSAT 1 0 0.001 0 0
Here are the contents of the parameter list that is returned:
print(str(ex1.params))
List of 4
$ label : chr "ex1"
$ folder : chr "/var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz"
$ files :List of 1
..$ input: chr "ex1.par"
$ settings:List of 6
..$ demes :Classes 'fscSettingsDemes', 'fscDeme' and 'data.frame': 1 obs. of 6 variables:
.. ..$ deme.name : chr "Deme.1"
.. ..$ deme.size : num 1000
.. ..$ sample.size: num 4
.. ..$ sample.time: num 0
.. ..$ inbreeding : num 0
.. ..$ growth : num 0
.. ..- attr(*, "ploidy")= num 2
..$ migration: NULL
..$ events : NULL
..$ genetics :Classes 'fscSettingsGenetics', 'fscBlock' and 'data.frame': 1 obs. of 8 variables:
.. ..$ chromosome : int 1
.. ..$ actual.type: chr "MICROSAT"
.. ..$ fsc.type : chr "MICROSAT"
.. ..$ num.markers: int 1
.. ..$ recomb.rate: num 0
.. ..$ mut.rate : num 0.001
.. ..$ param.5 : num 0
.. ..$ param.6 : num 0
.. ..- attr(*, "num.chrom")= num 5
.. ..- attr(*, "chrom.diff")= logi FALSE
..$ est : NULL
..$ def : NULL
NULL
It has three elements:
label
: the run label that we specified at the beginning.
files
: a list of files written and output.
settings
: a list of the settings used to write the input file.
This parameter object is used to run the simulation, will be updated with more information about the run, and will be used to extract and parse the data.
We are now ready to run the simulation, which is accomplished with the fscRun()
function. For example purposes, we will only run a single replicate. Note that the function returns an updated params
object that we will need for later.
ex1.params <- fscRun(ex1.params, num.sim = 1)
2020-02-04 06:33:05 running fastsimcoal2...
2020-02-04 06:33:06 run complete
This created a folder for the run and made three files in it:
dir(ex1.params$label)
character(0)
The file ending in “.arp” is an Arlequin formatted input file that contains the simulated genetic data. There will be one of these files for every replicate run. The ex1.params
object has also been updated with two lists, one contains the run parameters ($run.params
), and the other contains information used to map the loci in the output “.arp” file with the specified loci ($locus.info
):
str(ex1.params[c("run.params", "locus.info")])
List of 2
$ run.params:List of 17
..$ num.sims : num 1
..$ dna.to.snp : logi FALSE
..$ max.snps : num 0
..$ sfs.type : chr "maf"
..$ nonpar.boot : NULL
..$ all.sites : logi TRUE
..$ inf.sites : logi FALSE
..$ no.arl.output: logi FALSE
..$ num.loops : num 20
..$ min.num.loops: num 20
..$ brentol : num 0.01
..$ trees : logi FALSE
..$ num.cores : NULL
..$ seed : NULL
..$ quiet : logi TRUE
..$ exec : chr "fsc26"
..$ args : chr "--ifile ex1.par --numsims 1 --allsites --quiet"
$ locus.info:'data.frame': 5 obs. of 16 variables:
..$ name : chr [1:5] "C1B1_MICROSAT" "C2B1_MICROSAT" "C3B1_MICROSAT" "C4B1_MICROSAT" ...
..$ chromosome : int [1:5] 1 2 3 4 5
..$ block : int [1:5] 1 1 1 1 1
..$ actual.type : chr [1:5] "MICROSAT" "MICROSAT" "MICROSAT" "MICROSAT" ...
..$ fsc.type : chr [1:5] "MICROSAT" "MICROSAT" "MICROSAT" "MICROSAT" ...
..$ num.markers : int [1:5] 1 1 1 1 1
..$ recomb.rate : num [1:5] 0 0 0 0 0
..$ mut.rate : num [1:5] 0.001 0.001 0.001 0.001 0.001
..$ param.5 : num [1:5] 0 0 0 0 0
..$ param.6 : num [1:5] 0 0 0 0 0
..$ chrom.pos.start: num [1:5] 1 1 1 1 1
..$ chrom.pos.end : int [1:5] 1 1 1 1 1
..$ mat.col.start : num [1:5] 3 4 5 6 7
..$ mat.col.end : num [1:5] 3 4 5 6 7
..$ dna.start : logi [1:5] NA NA NA NA NA
..$ dna.end : logi [1:5] NA NA NA NA NA
Depending on parameters were used to run the simulation, and the kind of genetic data being simulated, there will be different files written to the output folder. There are separate functions for reading these different kinds of output. For Arlequin files, we use the fscReadArp()
function.
arp.file <- fscReadArp(ex1.params)
2020-02-04 06:33:06 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex1/ex1_1_1.arp
2020-02-04 06:33:06 parsing genetic data...
str(arp.file)
'data.frame': 4 obs. of 12 variables:
$ id : chr "1_1/1_2" "1_3/1_4" "1_5/1_6" "1_7/1_8"
$ deme : chr "Deme.1" "Deme.1" "Deme.1" "Deme.1"
$ C1B1_MICROSAT.1: chr "498" "497" "501" "498"
$ C1B1_MICROSAT.2: chr "498" "499" "498" "500"
$ C2B1_MICROSAT.1: chr "498" "500" "502" "502"
$ C2B1_MICROSAT.2: chr "501" "502" "502" "501"
$ C3B1_MICROSAT.1: chr "501" "501" "500" "501"
$ C3B1_MICROSAT.2: chr "498" "499" "500" "500"
$ C4B1_MICROSAT.1: chr "502" "502" "502" "502"
$ C4B1_MICROSAT.2: chr "502" "500" "501" "502"
$ C5B1_MICROSAT.1: chr "500" "499" "498" "499"
$ C5B1_MICROSAT.2: chr "500" "499" "498" "500"
- attr(*, "file")= chr "/var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex1/ex1_1_1.arp"
# The first 6 columns
arp.file[, 1:6]
id deme C1B1_MICROSAT.1 C1B1_MICROSAT.2 C2B1_MICROSAT.1
1 1_1/1_2 Deme.1 498 498 498
2 1_3/1_4 Deme.1 497 499 500
3 1_5/1_6 Deme.1 501 498 502
4 1_7/1_8 Deme.1 498 500 502
C2B1_MICROSAT.2
1 501
2 502
3 502
4 501
This function produces a data frame with one row for each individual genotype. The first two columns list the individual ids and their deme membership, and every column thereafter contains the genetic data. You’ll notice that the id column is composed of two different ids. This is because the actual output of the .arp file contains one row per haploid individual. We make diploid individuals by combining every two haplotypes.
Let’s simulate a more complex combination of genetic markers to see how to select specific markers from each .arp file.
# create 3 independent chromosomes with the same structure of four markers
complex.chroms <- fscSettingsGenetics(
fscBlock_microsat(2, 1e-4),
fscBlock_dna(4, 1e-5),
fscBlock_dna(6, 1e-3),
fscBlock_microsat(2, 1e-5),
num.chrom = 3
)
complex.params <- fscWrite(
demes = demes,
genetics = complex.chroms,
label = "complex_chroms"
)
complex.params <- fscRun(complex.params, num.sims = 1)
2020-02-04 06:33:06 running fastsimcoal2...
2020-02-04 06:33:07 run complete
arp <- fscReadArp(complex.params)
2020-02-04 06:33:07 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp
2020-02-04 06:33:07 parsing genetic data...
str(arp)
'data.frame': 4 obs. of 38 variables:
$ id : chr "1_1/1_2" "1_3/1_4" "1_5/1_6" "1_7/1_8"
$ deme : chr "Deme.1" "Deme.1" "Deme.1" "Deme.1"
$ C1B1_MICROSAT_L1.1: chr "500" "500" "500" "500"
$ C1B1_MICROSAT_L1.2: chr "501" "500" "501" "500"
$ C1B1_MICROSAT_L2.1: chr "500" "500" "500" "500"
$ C1B1_MICROSAT_L2.2: chr "500" "500" "500" "500"
$ C1B2_DNA.1 : chr "TAAG" "CACT" "CACT" "CACT"
$ C1B2_DNA.2 : chr "TACG" "TAAG" "TACG" "TAAG"
$ C1B3_DNA.1 : chr "GAATAA" "GAAACA" "GAGACG" "AATAGG"
$ C1B3_DNA.2 : chr "CTAGGG" "TTTAAA" "TCAAAC" "TCTTGC"
$ C1B4_MICROSAT_L1.1: chr "500" "500" "500" "500"
$ C1B4_MICROSAT_L1.2: chr "500" "500" "500" "500"
$ C1B4_MICROSAT_L2.1: chr "500" "499" "499" "499"
$ C1B4_MICROSAT_L2.2: chr "500" "500" "500" "500"
$ C2B1_MICROSAT_L1.1: chr "499" "500" "500" "500"
$ C2B1_MICROSAT_L1.2: chr "499" "500" "499" "500"
$ C2B1_MICROSAT_L2.1: chr "500" "500" "500" "500"
$ C2B1_MICROSAT_L2.2: chr "500" "500" "500" "500"
$ C2B2_DNA.1 : chr "GAAA" "TAAA" "TAAA" "TAAA"
$ C2B2_DNA.2 : chr "GAAA" "TAAA" "GAAA" "TAAA"
$ C2B3_DNA.1 : chr "CAATAC" "GAATTG" "ATATTC" "ATCATC"
$ C2B3_DNA.2 : chr "CTATAC" "ATCATC" "CAATAC" "ATCATC"
$ C2B4_MICROSAT_L1.1: chr "500" "500" "500" "500"
$ C2B4_MICROSAT_L1.2: chr "500" "500" "500" "500"
$ C2B4_MICROSAT_L2.1: chr "500" "500" "500" "500"
$ C2B4_MICROSAT_L2.2: chr "500" "500" "500" "500"
$ C3B1_MICROSAT_L1.1: chr "499" "499" "499" "500"
$ C3B1_MICROSAT_L1.2: chr "499" "499" "500" "499"
$ C3B1_MICROSAT_L2.1: chr "499" "500" "500" "500"
$ C3B1_MICROSAT_L2.2: chr "500" "499" "500" "500"
$ C3B2_DNA.1 : chr "AAAA" "AAAA" "AAAA" "AAAA"
$ C3B2_DNA.2 : chr "ACAA" "AAAA" "AAAA" "AAAA"
$ C3B3_DNA.1 : chr "CCATGG" "CGGATC" "CGGATA" "ACACCC"
$ C3B3_DNA.2 : chr "GCATAA" "TCGTGG" "ACGGCA" "GTTACT"
$ C3B4_MICROSAT_L1.1: chr "500" "500" "500" "500"
$ C3B4_MICROSAT_L1.2: chr "500" "500" "500" "500"
$ C3B4_MICROSAT_L2.1: chr "500" "500" "500" "500"
$ C3B4_MICROSAT_L2.2: chr "500" "500" "500" "500"
- attr(*, "file")= chr "/var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp"
Note that we get all marker types and loci simulated. By default, each locus is separated into its own column. Column names are assigned denoting the chromsome number (e.g., C2), the block number (e.g., B1), what kind of marker it is (e.g., MICROSAT). If there are several loci for that marker, then that is numbered as well (e.g., L1).
If multiple markers have been simulated, specific markers can be selected by using the marker
argument. Below we read only the microsattelite loci that were generated.
str(fscReadArp(complex.params, marker = "microsat"))
2020-02-04 06:33:07 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp
2020-02-04 06:33:07 parsing genetic data...
'data.frame': 4 obs. of 26 variables:
$ id : chr "1_1/1_2" "1_3/1_4" "1_5/1_6" "1_7/1_8"
$ deme : chr "Deme.1" "Deme.1" "Deme.1" "Deme.1"
$ C1B1_MICROSAT_L1.1: chr "500" "500" "500" "500"
$ C1B1_MICROSAT_L1.2: chr "501" "500" "501" "500"
$ C1B1_MICROSAT_L2.1: chr "500" "500" "500" "500"
$ C1B1_MICROSAT_L2.2: chr "500" "500" "500" "500"
$ C1B4_MICROSAT_L1.1: chr "500" "500" "500" "500"
$ C1B4_MICROSAT_L1.2: chr "500" "500" "500" "500"
$ C1B4_MICROSAT_L2.1: chr "500" "499" "499" "499"
$ C1B4_MICROSAT_L2.2: chr "500" "500" "500" "500"
$ C2B1_MICROSAT_L1.1: chr "499" "500" "500" "500"
$ C2B1_MICROSAT_L1.2: chr "499" "500" "499" "500"
$ C2B1_MICROSAT_L2.1: chr "500" "500" "500" "500"
$ C2B1_MICROSAT_L2.2: chr "500" "500" "500" "500"
$ C2B4_MICROSAT_L1.1: chr "500" "500" "500" "500"
$ C2B4_MICROSAT_L1.2: chr "500" "500" "500" "500"
$ C2B4_MICROSAT_L2.1: chr "500" "500" "500" "500"
$ C2B4_MICROSAT_L2.2: chr "500" "500" "500" "500"
$ C3B1_MICROSAT_L1.1: chr "499" "499" "499" "500"
$ C3B1_MICROSAT_L1.2: chr "499" "499" "500" "499"
$ C3B1_MICROSAT_L2.1: chr "499" "500" "500" "500"
$ C3B1_MICROSAT_L2.2: chr "500" "499" "500" "500"
$ C3B4_MICROSAT_L1.1: chr "500" "500" "500" "500"
$ C3B4_MICROSAT_L1.2: chr "500" "500" "500" "500"
$ C3B4_MICROSAT_L2.1: chr "500" "500" "500" "500"
$ C3B4_MICROSAT_L2.2: chr "500" "500" "500" "500"
- attr(*, "file")= chr "/var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp"
Genotypes can also be formatted to combine alleles into one column per locus by setting one.col = TRUE
:
fscReadArp(complex.params, marker = "microsat", one.col = TRUE)[, 1:6]
2020-02-04 06:33:07 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp
2020-02-04 06:33:07 parsing genetic data...
id deme C1B1_MICROSAT_L1 C1B1_MICROSAT_L2 C1B4_MICROSAT_L1
1 1_1/1_2 Deme.1 500/501 500/500 500/500
2 1_3/1_4 Deme.1 500/500 500/500 500/500
3 1_5/1_6 Deme.1 500/501 500/500 500/500
4 1_7/1_8 Deme.1 500/500 500/500 500/500
C1B4_MICROSAT_L2
1 500/500
2 499/500
3 499/500
4 499/500
Specific chromosomes can be selected by providing their numbers to the chrom
argument:
arp <- fscReadArp(complex.params, chrom = c(1, 3), one.col = TRUE)
2020-02-04 06:33:07 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp
2020-02-04 06:33:07 parsing genetic data...
str(arp)
'data.frame': 4 obs. of 14 variables:
$ id : chr "1_1/1_2" "1_3/1_4" "1_5/1_6" "1_7/1_8"
$ deme : chr "Deme.1" "Deme.1" "Deme.1" "Deme.1"
$ C1B1_MICROSAT_L1: chr "500/501" "500/500" "500/501" "500/500"
$ C1B1_MICROSAT_L2: chr "500/500" "500/500" "500/500" "500/500"
$ C1B2_DNA : chr "TAAG/TACG" "CACT/TAAG" "CACT/TACG" "CACT/TAAG"
$ C1B3_DNA : chr "GAATAA/CTAGGG" "GAAACA/TTTAAA" "GAGACG/TCAAAC" "AATAGG/TCTTGC"
$ C1B4_MICROSAT_L1: chr "500/500" "500/500" "500/500" "500/500"
$ C1B4_MICROSAT_L2: chr "500/500" "499/500" "499/500" "499/500"
$ C3B1_MICROSAT_L1: chr "499/499" "499/499" "499/500" "500/499"
$ C3B1_MICROSAT_L2: chr "499/500" "500/499" "500/500" "500/500"
$ C3B2_DNA : chr "AAAA/ACAA" "AAAA/AAAA" "AAAA/AAAA" "AAAA/AAAA"
$ C3B3_DNA : chr "CCATGG/GCATAA" "CGGATC/TCGTGG" "CGGATA/ACGGCA" "ACACCC/GTTACT"
$ C3B4_MICROSAT_L1: chr "500/500" "500/500" "500/500" "500/500"
$ C3B4_MICROSAT_L2: chr "500/500" "500/500" "500/500" "500/500"
- attr(*, "file")= chr "/var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp"
Each chromosome can also be separated into its own list by specifying sep.chrom = TRUE
:
arp <- fscReadArp(complex.params, sep.chrom = TRUE)
2020-02-04 06:33:07 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp
2020-02-04 06:33:07 parsing genetic data...
str(arp)
List of 4
$ C1 :'data.frame': 4 obs. of 14 variables:
..$ id : chr [1:4] "1_1/1_2" "1_3/1_4" "1_5/1_6" "1_7/1_8"
..$ deme : chr [1:4] "1" "1" "1" "1"
..$ C1B1_MICROSAT_L1.1: chr [1:4] "500" "500" "500" "500"
..$ C1B1_MICROSAT_L1.2: chr [1:4] "501" "500" "501" "500"
..$ C1B1_MICROSAT_L2.1: chr [1:4] "500" "500" "500" "500"
..$ C1B1_MICROSAT_L2.2: chr [1:4] "500" "500" "500" "500"
..$ C1B2_DNA.1 : chr [1:4] "TAAG" "CACT" "CACT" "CACT"
..$ C1B2_DNA.2 : chr [1:4] "TACG" "TAAG" "TACG" "TAAG"
..$ C1B3_DNA.1 : chr [1:4] "GAATAA" "GAAACA" "GAGACG" "AATAGG"
..$ C1B3_DNA.2 : chr [1:4] "CTAGGG" "TTTAAA" "TCAAAC" "TCTTGC"
..$ C1B4_MICROSAT_L1.1: chr [1:4] "500" "500" "500" "500"
..$ C1B4_MICROSAT_L1.2: chr [1:4] "500" "500" "500" "500"
..$ C1B4_MICROSAT_L2.1: chr [1:4] "500" "499" "499" "499"
..$ C1B4_MICROSAT_L2.2: chr [1:4] "500" "500" "500" "500"
$ C2 :'data.frame': 4 obs. of 14 variables:
..$ id : chr [1:4] "1_1/1_2" "1_3/1_4" "1_5/1_6" "1_7/1_8"
..$ deme : chr [1:4] "1" "1" "1" "1"
..$ C2B1_MICROSAT_L1.1: chr [1:4] "499" "500" "500" "500"
..$ C2B1_MICROSAT_L1.2: chr [1:4] "499" "500" "499" "500"
..$ C2B1_MICROSAT_L2.1: chr [1:4] "500" "500" "500" "500"
..$ C2B1_MICROSAT_L2.2: chr [1:4] "500" "500" "500" "500"
..$ C2B2_DNA.1 : chr [1:4] "GAAA" "TAAA" "TAAA" "TAAA"
..$ C2B2_DNA.2 : chr [1:4] "GAAA" "TAAA" "GAAA" "TAAA"
..$ C2B3_DNA.1 : chr [1:4] "CAATAC" "GAATTG" "ATATTC" "ATCATC"
..$ C2B3_DNA.2 : chr [1:4] "CTATAC" "ATCATC" "CAATAC" "ATCATC"
..$ C2B4_MICROSAT_L1.1: chr [1:4] "500" "500" "500" "500"
..$ C2B4_MICROSAT_L1.2: chr [1:4] "500" "500" "500" "500"
..$ C2B4_MICROSAT_L2.1: chr [1:4] "500" "500" "500" "500"
..$ C2B4_MICROSAT_L2.2: chr [1:4] "500" "500" "500" "500"
$ C3 :'data.frame': 4 obs. of 14 variables:
..$ id : chr [1:4] "1_1/1_2" "1_3/1_4" "1_5/1_6" "1_7/1_8"
..$ deme : chr [1:4] "1" "1" "1" "1"
..$ C3B1_MICROSAT_L1.1: chr [1:4] "499" "499" "499" "500"
..$ C3B1_MICROSAT_L1.2: chr [1:4] "499" "499" "500" "499"
..$ C3B1_MICROSAT_L2.1: chr [1:4] "499" "500" "500" "500"
..$ C3B1_MICROSAT_L2.2: chr [1:4] "500" "499" "500" "500"
..$ C3B2_DNA.1 : chr [1:4] "AAAA" "AAAA" "AAAA" "AAAA"
..$ C3B2_DNA.2 : chr [1:4] "ACAA" "AAAA" "AAAA" "AAAA"
..$ C3B3_DNA.1 : chr [1:4] "CCATGG" "CGGATC" "CGGATA" "ACACCC"
..$ C3B3_DNA.2 : chr [1:4] "GCATAA" "TCGTGG" "ACGGCA" "GTTACT"
..$ C3B4_MICROSAT_L1.1: chr [1:4] "500" "500" "500" "500"
..$ C3B4_MICROSAT_L1.2: chr [1:4] "500" "500" "500" "500"
..$ C3B4_MICROSAT_L2.1: chr [1:4] "500" "500" "500" "500"
..$ C3B4_MICROSAT_L2.2: chr [1:4] "500" "500" "500" "500"
$ deme: chr(0)
- attr(*, "file")= chr "/var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/complex_chroms/complex_chroms_1_1.arp"
Although fastsimcoal2
has a “SNP” marker type, the manual explicitly suggests that it not be used to actually simulate SNPs due to biases in the site frequency spectrum that can arise from the way they are generated. In the strataG
wrapper, SNPs are simulated based on the current recommended procedure by Excoffier as short DNA sequences with that are only allowed to mutate via transitions (transition.rate = 1
).
The general strategy is to simulate a short sequence of base pairs over a large number of chromosomes such that they are all considered to be independent, unlinked loci. The mutation rate can then be adjusted to produce the desired number of SNPs. When running the simulation, we specify the all.sites = F
argument, which will return only polymorphic sites and save time parsing the data and having to filter out monomorphic sites.
Here we set up a simple model to demonstrate:
rm(list = ls())
library(strataG)
demes <- fscSettingsDemes(fscDeme(deme.size = 1000, sample.size = 10))
genetics <- fscSettingsGenetics(fscBlock_snp(10, 1e-6), num.chrom = 1000)
p <- fscWrite(demes = demes, genetics = genetics, label = "ex2.snps.1k")
p <- fscRun(p, all.sites = F)
2020-02-04 06:33:07 running fastsimcoal2...
2020-02-04 06:33:09 run complete
snp.df <- fscReadArp(p)
2020-02-04 06:33:09 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex2.snps.1k/ex2.snps.1k_1_1.arp
2020-02-04 06:33:09 parsing genetic data...
# an example of the data generated
snp.df[1:6, 1:6]
id deme C0005B1_SNP.1 C0005B1_SNP.2 C0007B1_SNP.1 C0007B1_SNP.2
1 1_1/1_2 Deme.1 G G C C
2 1_3/1_4 Deme.1 A G C C
3 1_5/1_6 Deme.1 G G C C
4 1_7/1_8 Deme.1 A G T C
5 1_9/1_10 Deme.1 G G C T
6 1_11/1_12 Deme.1 G G C C
This particular run generated 260 out of a possible 10000 SNP loci simulated.
Some of these SNPs are linked on the same 10 base pair chromosome. Let’s see what percent of chromosomes (loci) contain 1, 2, 3.. SNPs.
snpOccurFreq <- function(mat) {
# Extract the SNP names from the matrix column names
snp.name <- colnames(mat[, -(1:2)])
# Extract the chromosome name (starts with "C" and is followed by numbers)
# from the SNP names
chrom.names <- regmatches(snp.name, regexpr("^C[[:digit:]]+", snp.name))
# Count number of occurrences of each chromosome
chrom.freq <- table(chrom.names)
# Get frequencies of number of occurrences (how many 1s, 2s, 3s...)
table(chrom.freq, dnn = NULL)
}
# The occurrence frequencies
snp.occ.freq <- snpOccurFreq(snp.df)
snp.occ.freq
2 4
112 9
# Convert to proportions
snp.occ.prop <- prop.table(snp.occ.freq)
round(snp.occ.prop, 3)
2 4
0.926 0.074
This shows that most SNPs (100%) are non-linked, but there are some loci with more than one SNP. If we want a data set of only unlinked loci, we will have to randomly select one SNP from each locus.
sampleOnePerLocus <- function(mat) {
# Extract the SNP names from the matrix column names
snp.name <- colnames(mat[, -(1:2)])
# Extract the chromosome name (starts with "C" and is followed by numbers)
# from the SNP names
chrom.names <- regmatches(snp.name, regexpr("^C[[:digit:]]+", snp.name))
# Choose one SNP per chromosome
one.per.loc <- tapply(colnames(mat[, -(1:2)]), chrom.names, sample, size = 1)
# Return matrix of
mat[, c("id", "deme", one.per.loc)]
}
unlinked.snps <- sampleOnePerLocus(snp.df)
# number of unlinked SNPs
ncol(unlinked.snps) - 2
[1] 121
To get more SNPs, we could increase the mutation rate, however that will increase the number of hits on multiple sites at each locus as well. If we decrease the number of sites at each locus, we do not increase mutations at other loci because each locus is simulated independently. The only way to increase the number of independent SNPs is the increase the number of independent loci (chromosomes). Note that this will also increase the time each simulation takes.
Here we simulated 10,000 chromosomes rather than 1,000 we simulated previously.
genetics <- fscSettingsGenetics(fscBlock_snp(10, 1e-6), num.chrom = 10000)
p <- fscWrite(demes = demes, genetics = genetics, label = "ex2.snps.10k")
p <- fscRun(p, all.sites = F)
2020-02-04 06:33:09 running fastsimcoal2...
2020-02-04 06:33:10 run complete
snp.df <- fscReadArp(p)
2020-02-04 06:33:10 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex2.snps.10k/ex2.snps.10k_1_1.arp
2020-02-04 06:33:10 parsing genetic data...
# number of SNPs
ncol(snp.df) - 2
[1] 2774
# proportion of n SNPs per locus
round(prop.table(snpOccurFreq(snp.df)), 3)
2 4 6
0.935 0.062 0.003
By increasing the number of loci by an order of magnitude we get an order of magnitude more SNPs, but keep the same proportion of unlinked SNPs. For the sake of comparison, here is the result when we increase mutation rate by an order of magnitude, but keep the number of chromosomes at 1,000:
genetics <- fscSettingsGenetics(fscBlock_snp(10, 1e-5), num.chrom = 1000)
p <- fscWrite(demes = demes, genetics = genetics, label = "ex2.snps.mut")
p <- fscRun(p, all.sites = F, num.sims = 1)
2020-02-04 06:33:11 running fastsimcoal2...
2020-02-04 06:33:12 run complete
snp.df <- fscReadArp(p)
2020-02-04 06:33:12 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex2.snps.mut/ex2.snps.mut_1_1.arp
2020-02-04 06:33:12 parsing genetic data...
# number of SNPs
ncol(snp.df) - 2
[1] 2638
# proportion of n SNPs per locus
snp.occ.freq <- snpOccurFreq(snp.df)
round(prop.table(snp.occ.freq), 3)
2 4 6 8 10 12 14
0.484 0.303 0.143 0.052 0.012 0.003 0.001
We still get an order of magnitude more SNPs, but we have a much lower percentage that are unlinked. The full number of unlinked SNPs in this run is 725, which is more than we had with a lower mutation rate and the same number of loci, but less than we had at the same mutation rate, but with more loci.
The final twist to this issue is that we are currently not using an infinite sites model. Because SNPs are being simulated as DNA base pairs that only mutate by transitions, there is chance of having multiple mutations at the same site that would go unobserved. Thus, the number of SNPs without infinite sites turned on will always be an underestimate of the actual number of segregating sites. The expected number of SNPs is \(2N\mu s\sum_{i=1}^{n-1}(\frac{1}{i})\) where N is the number of haploid individuals, \(\mu\) is the mutation rate, n is the number of haploid genes sampled, and s is the total number of sites.
Below is a model that we’ll use to demonstrate the effect. We have a diploid population size of 5000 (10,000 haploid genes), and we’re sampling 5 individuals (10 haploid genes). The mutation rate is 10-5 and we’ll be simulating 1000 unlinked loci, each one 10 base pairs long. Thus, the expected number of SNPs is 5658.
demes <- fscSettingsDemes(fscDeme(deme.size = 5000, sample.size = 5))
genetics <- fscSettingsGenetics(fscBlock_snp(10, 1e-5), num.chrom = 1000)
p <- fscWrite(demes = demes, genetics = genetics, label = "ex2.inf.sites")
p <- fscRun(p, all.sites = F)
2020-02-04 06:33:12 running fastsimcoal2...
2020-02-04 06:33:14 run complete
snp.df <- fscReadArp(p)
2020-02-04 06:33:14 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex2.inf.sites/ex2.inf.sites_1_1.arp
2020-02-04 06:33:14 parsing genetic data...
# number of SNPs recovered
ncol(snp.df) - 2
[1] 8346
Here’s the same model with infinite sites turned on (inf.sites = T
), meaning every mutation is recorded as a new allele, even if it is at the same site.
p <- fscRun(p, all.sites = F, inf.sites = T, num.sims = 1)
2020-02-04 06:33:14 running fastsimcoal2...
2020-02-04 06:33:15 run complete
snp.df <- fscReadArp(p)
2020-02-04 06:33:15 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex2.inf.sites/ex2.inf.sites_1_1.arp
2020-02-04 06:33:15 parsing genetic data...
# number of SNPs recovered
ncol(snp.df) - 2
[1] 11240
Here the observed number of SNPs is much closer to the expected number. The decision of whether or not to use an infinite sites model depends on if you are trying to simulate loci for comparison to an empirical data set where you are not able to observe multiple mutations on the same site (inf.sites = FALSE
) or to a theoretical model (inf.sites = TRUE
), where it is assumed that these mutations have been observed. Note that reported mutation rates are frequently underestimates of the true mutation rate unless they actively try to account for these hidden mutations.
Migration between demes is specified by supplying matrices of migration rates for all pairs of demes through the fscSettingsMigration()
function. The values in a migration matrix are read as the probability that an individual will migrate from one deme (rows) to another deme (column). Migration rates (m) between any pair of demes do not have to be the same (e.g., \(m_{i,j}\) does not have to equal \(m_{j,i}\)). Also, values along the diagonal are ignored, but do have to be numeric.
Here’s a simple migration matrix where two demes exchange individuals at a rate of 0.00001 per generation.
m <- 0.00001
mig.mat <- matrix(c(0, m, m, 0), nrow = 2)
mig.mat
[,1] [,2]
[1,] 0.00000 0.00001
[2,] 0.00001 0.00000
We then set up and run a simulation where of two demes with 1000 individuals. We take 10 samples from each and are simulating 1000 SNPs:
demes <- fscSettingsDemes(fscDeme(1000, 10), fscDeme(1000, 10))
genetics <- fscSettingsGenetics(fscBlock_snp(1, 1e-6), num.chrom = 1000)
p <- fscWrite(
demes = demes,
migration = fscSettingsMigration(mig.mat),
genetics = genetics,
label = "ex3.mig.ex"
)
p <- fscRun(p, all.sites = F)
2020-02-04 06:33:16 running fastsimcoal2...
2020-02-04 06:33:17 run complete
The expected number of migrants per generation (Nm) is 0.01 which shouldn’t be enough to homogenize the populations. We can confirm with an Fst test:
snp.df <- fscReadArp(p, one.col = F)
2020-02-04 06:33:17 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex3.mig.ex/ex3.mig.ex_1_1.arp
2020-02-04 06:33:17 parsing genetic data...
snp.g <- df2gtypes(snp.df, ploidy = 2)
overallTest(snp.g, stat = "fst")
<<< gtypes created on 2020-02-04 06:33:17 >>>
2020-02-04 06:33:17 : Overall test : 1000 permutations
N
Deme.1 10
Deme.2 10
Population structure results:
estimate p.val
Fst 0.9075205 0.000999001
As expected, Fst is relatively high and shows significant differentiation. Lets see what happens if we increase migration several orders of magnitude.
m.vec <- c(0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05)
fst <- sapply(m.vec, function(m) {
mig.mat <- matrix(c(0, m, m, 0), nrow = 2)
p <- fscWrite(
demes = demes,
migration = fscSettingsMigration(mig.mat),
genetics = genetics,
label = "mig.test"
)
p <- fscRun(p, all.sites = F, inf.sites = T)
snp.df <- fscReadArp(p, one.col = F)
snp.g <- df2gtypes(snp.df, ploidy = 2)
overallTest(snp.g, stat = c("fst"), quietly = T)$result[1, ]
})
cbind(
m = m.vec,
Nm = 1000 * m.vec,
expFst = 1 / ((4 * 1000 * m.vec) + 1),
t(fst)
)
m Nm expFst estimate p.val
[1,] 0.0001 0.1 0.714285714 0.525632437 0.000999001
[2,] 0.0005 0.5 0.333333333 0.153808474 0.000999001
[3,] 0.0010 1.0 0.200000000 0.067022935 0.002997003
[4,] 0.0050 5.0 0.047619048 0.030007344 0.033966034
[5,] 0.0100 10.0 0.024390244 0.007674341 0.265734266
[6,] 0.0500 50.0 0.004975124 0.004444444 0.343656344
As expected, Fst decreases as m and Nm increase, becoming non-significant at the largest value, indicating that between 10 and 50 migrants per generation is sufficient to homogenize these two demes.
We can set up a variety of migration rate matrices to examine various forms of connectivity among demes. Two classical forms are island and stepping stone models. In an island model, migration occurs between all populations, and in the simplest form all migration rates are equal.
Here is an island model for 5 populations with m set to 0.0005:
num.demes <- 5
m <- 0.0005
mig.rate <- m / (num.demes - 1)
island.mat <- matrix(rep(mig.rate, num.demes ^ 2), nrow = num.demes)
diag(island.mat) <- 0
island.mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0.000000 0.000125 0.000125 0.000125 0.000125
[2,] 0.000125 0.000000 0.000125 0.000125 0.000125
[3,] 0.000125 0.000125 0.000000 0.000125 0.000125
[4,] 0.000125 0.000125 0.000125 0.000000 0.000125
[5,] 0.000125 0.000125 0.000125 0.000125 0.000000
qgraph::qgraph(island.mat)
Registered S3 methods overwritten by 'huge':
method from
plot.sim BDgraph
print.sim BDgraph
Note that m is the rate for the entire population, so we have to apportion it to each population by dividing it by the number of connections (k - 1) for every entry. As mentioned before, fastsimcoal2
ignores the diagonal, so we set it to 0 so that the between deme rates are highlighted in the figure.
In a stepping stone model, migration is only between neighboring demes, which sets up a form of isolation by distance. Here is some code to set up a matrix representing the same 5 demes in a stepping stone configuration.
mig.rate <- m / 2
step.mat <- matrix(0, nrow = num.demes, ncol = num.demes)
# set rate for neighbors
for (k in 1:(num.demes - 1)) {
step.mat[k, k + 1] <- step.mat[k + 1, k] <- mig.rate
}
# demes at ends
step.mat[1, num.demes] <- step.mat[num.demes, 1] <- mig.rate
diag(step.mat) <- 0
step.mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0.00000 0.00025 0.00000 0.00000 0.00025
[2,] 0.00025 0.00000 0.00025 0.00000 0.00000
[3,] 0.00000 0.00025 0.00000 0.00025 0.00000
[4,] 0.00000 0.00000 0.00025 0.00000 0.00025
[5,] 0.00025 0.00000 0.00000 0.00025 0.00000
qgraph::qgraph(step.mat)
In this configuration, demes are linked, with each exchanging 0.00025 of its members with each neighbor. This form makes for a closed stepping stone model as the ends are linked. For an open, linear model with equivalent migration rates for the end demes, one would double their rates with their neighbor (e.g., step.mat[1, 2] <- 0.0005
and step.mat[1, 5] <- 0
).
Here’s the simulation with the island model.
demes <- fscSettingsDemes(
fscDeme(1000, 10), fscDeme(1000, 10), fscDeme(1000, 10),
fscDeme(1000, 10), fscDeme(1000, 10)
)
genetics <- fscSettingsGenetics(fscBlock_snp(1, 1e-5), num.chrom = 1000)
p.island <- fscWrite(
demes = demes,
migration = fscSettingsMigration(island.mat),
genetics = genetics,
label = "ex3.island"
)
p.island <- fscRun(p.island, all.sites = F)
2020-02-04 06:33:29 running fastsimcoal2...
2020-02-04 06:33:31 run complete
…and the one with the stepping stone model.
p.step <- fscWrite(
demes = demes,
migration = fscSettingsMigration(step.mat),
genetics = genetics,
label = "ex3.stepping.stone"
)
p.step <- fscRun(p.step, all.sites = F)
2020-02-04 06:33:31 running fastsimcoal2...
2020-02-04 06:33:32 run complete
Here’s Fst for both models:
# expected Fst
1 / ((4 * 1000 * m) + 1)
[1] 0.3333333
island.g <- df2gtypes(fscReadArp(p.island, one.col = F), ploidy = 2)
2020-02-04 06:33:32 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex3.island/ex3.island_1_1.arp
2020-02-04 06:33:32 parsing genetic data...
statFst(island.g)$result["estimate"]
estimate
0.2753774
step.g <- df2gtypes(fscReadArp(p.step, one.col = F), ploidy = 2)
2020-02-04 06:33:33 reading /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/ex3.stepping.stone/ex3.stepping.stone_1_1.arp
2020-02-04 06:33:33 parsing genetic data...
statFst(step.g)$result["estimate"]
estimate
0.3379983
Finally, we can make migration matrices spatially explicit by using pairwise spatial distances among demes to weight migration rates. For example, here are five randomly chosen demes in two-dimensional space:
# choose 5 random points in two dimensions
set.seed(50)
deme.pos <- data.frame(
x = runif(num.demes, 0, 0.5),
y = runif(num.demes, 0, 0.5)
)
rownames(deme.pos) <- 1:num.demes
plot(deme.pos, type = "n")
text(deme.pos, labels = rownames(deme.pos))
We then compute the Euclidean distance among pairs in this space and scale it to the smallest distance:
euc.dist <- dist(deme.pos[, -1], diag = FALSE, upper = TRUE)
scaled.dist <- as.matrix(euc.dist / min(euc.dist))
scaled.dist
1 2 3 4 5
1 0.00000 245.00586 224.97445 1.00000 23.49712
2 245.00586 0.00000 20.03141 246.00586 221.50875
3 224.97445 20.03141 0.00000 225.97445 201.47734
4 1.00000 246.00586 225.97445 0.00000 24.49712
5 23.49712 221.50875 201.47734 24.49712 0.00000
Every other distance is a multiple of this distance. Dividing the reference migration rate by this matrix then gives a very strict and linear form of isolation by distance.
ibd.mat <- 0.05 / scaled.dist
diag(ibd.mat) <- 0
ibd.mat
1 2 3 4 5
1 0.0000000000 0.0002040767 0.0002222475 0.0500000000 0.0021279208
2 0.0002040767 0.0000000000 0.0024960795 0.0002032472 0.0002257247
3 0.0002222475 0.0024960795 0.0000000000 0.0002212640 0.0002481669
4 0.0500000000 0.0002032472 0.0002212640 0.0000000000 0.0020410566
5 0.0021279208 0.0002257247 0.0002481669 0.0020410566 0.0000000000
qgraph::qgraph(ibd.mat)
Here’s one replicate of a simulation with this matrix.
demes <- fscSettingsDemes(
fscDeme(1000, 10), fscDeme(1000, 10), fscDeme(1000, 10),
fscDeme(1000, 10), fscDeme(1000, 10)
)
p.ibd <- fscWrite(
demes = demes,
migration = fscSettingsMigration(ibd.mat),
genetics = genetics,
label = "ex3.ibd"
)
p.ibd <- fscRun(p.ibd, all.sites = F)
2020-02-04 06:33:34 running fastsimcoal2...
2020-02-04 06:33:36 run complete
Here are pairwise Fst tests:
ibd.g <- df2gtypes(fscReadArp(p.ibd, one.col = F), ploidy = 2)
pws <- pairwiseTest(ibd.g, stat = "fst", nrep = 100)$pair.mat$Fst
pws
Deme.1 Deme.2 Deme.3 Deme.4 Deme.5
Deme.1 NA 0.00990099 0.00990099 0.17821782 0.00990099
Deme.2 0.104715402 NA 0.00990099 0.00990099 0.00990099
Deme.3 0.100666644 0.04711621 NA 0.00990099 0.00990099
Deme.4 0.003248863 0.09677419 0.09303789 NA 0.00990099
Deme.5 0.038339841 0.10906901 0.11275554 0.04397365 NA
The lower left triangle of the matrix are pairwise Fst values and the upper right are permutation p-values. This shows that there is less differentiation among demes that are relatively close, while those that are farther apart show higher Fst values and are significantly different.
So far we have run simulations where the only interaction between demes was migration. In the simulation, migration occurs throughout the coalescence process at a regular rate. fastsimcoal2
also allows us to specify a more complex demographic history of the demes by describing events where demes exchange individuals at a different rates or change sizes at specific points in time. These historical events are created with the fscEvent()
function and are loaded as comma separated items with the fscSettingsEvents()
function.
Below, we will duplicate the “3popDNASFS.par” model in the fastsimcoal2
manual on page 32. Here is the historical event specification for that model:
events <- fscSettingsEvents(
fscEvent(
event.time = 2000,
source = 1,
sink = 2,
prop.migrants = 0.05,
new.size = 1,
new.growth = 0,
migr.mat = 0
),
fscEvent(
event.time = 2980,
source = 1,
sink = 1,
prop.migrants = 0,
new.size = 0.04
),
fscEvent(3000, 1, 0, 1, 1),
fscEvent(15000, 0, 2, 1, 3)
)
The “source” deme is the deme from which the individuals come, and the “sink” is the deme to which they go. If both source and sink are the same deme, then the event is describing a change within that deme. prop.migrants
describes the proportion of migrants that are affected. new.size
is a multiplier describing the factor by which the sink deme grows or shrinks in the past. growth.rate
gives a new growth rate for the source deme, and migr.mat
that specifies the number of the migration matrix (starting at index 0) in effect prior to this event. These latter two parameters are left at the default of 0 for this model.
The above specification describes four events going backwards in time:
Generations | Description |
---|---|
-2000 | 5% of the genes in deme 1 (the second deme) move to deme 2. deme 1 stays at its original size. |
-2980 | deme 1 reduces to 4% of its size. |
-3000 | 100% of the genes in deme 1 move to deme 0. |
-15000 | 100% of the genes in deme 0 move to deme 2 and it grows to 3x its size. |
Multiple iterations of a fastsimcoal2
model with different settings values can be run by programmatically defining settings and re-running the code. Here’s an example where we loop through a parameter space of population sizes and migration rates. Note that we change the label each iteration so as not to overwrite files:
param.df <- data.frame(
N = 10 ^ runif(5, 2, 4),
MIG = 10 ^ runif(5, -8, -5)
)
param.p <- lapply(1:nrow(param.df), function(i) {
N <- param.df$N[i]
MIG <- param.df$MIG[i]
p <- fscWrite(
demes = fscSettingsDemes(fscDeme(N, 5), fscDeme(N, 5)),
migration = fscSettingsMigration(matrix(c(0, MIG, MIG, 0), nrow = 2)),
genetics = fscSettingsGenetics(fscBlock_snp(100, 1e-6), num.chrom = 1000),
label = paste0("param.sim.", i)
)
fscRun(p)
})
2020-02-04 06:33:44 running fastsimcoal2...
2020-02-04 06:33:46 run complete
2020-02-04 06:33:46 running fastsimcoal2...
2020-02-04 06:33:47 run complete
2020-02-04 06:33:47 running fastsimcoal2...
2020-02-04 06:33:49 run complete
2020-02-04 06:33:49 running fastsimcoal2...
2020-02-04 06:33:50 run complete
2020-02-04 06:33:50 running fastsimcoal2...
2020-02-04 06:33:52 run complete
dir(p$folder, pattern = "param.sim")
[1] "param.sim.1" "param.sim.1.log" "param.sim.1.par" "param.sim.2"
[5] "param.sim.2.log" "param.sim.2.par" "param.sim.3" "param.sim.3.log"
[9] "param.sim.3.par" "param.sim.4" "param.sim.4.log" "param.sim.4.par"
[13] "param.sim.5" "param.sim.5.log" "param.sim.5.par"
We can accomplish the same task by substituting character names for parameters in the settings and providing a matrix of parameter definition values. This is the same procedure described in the section entitled, “USING PREDEFINED VALUES FOR A PARTICULAR EVOLUTIONARY MODEL” on page 37 in the fastsimcoal2 manual. Below is an example using the parameter data frame from above:
p <- fscWrite(
demes = fscSettingsDemes(fscDeme("N", 5), fscDeme("N", 5)),
migration = fscSettingsMigration(matrix(c(0, "MIG", "MIG", 0), nrow = 2)),
genetics = fscSettingsGenetics(fscBlock_snp(100, 1e-6), num.chrom = 1000),
def = fscSettingsDef(param.df),
label = "param.sim"
)
p <- fscRun(p)
2020-02-04 06:33:52 running fastsimcoal2...
2020-02-04 06:33:53 run complete
dir(p$folder, pattern = "param.sim.def")
[1] "param.sim.def"
fastsimcoal2
also provides the functionality to estimate demographic parameters of a given coalescence model given empirical data. The empirical data is in the form of the site frequency spectrum (SFS) of SNP loci. To demonstrate, we first specify a model with known parameters that generates a SFS. We use the 1PopBot20Mb
example from the fastsimcoal2
manual on page 38.
obs.p <- fscWrite(
demes = fscSettingsDemes(fscDeme(7300, 20)),
events = fscSettingsEvents(
fscEvent(9800, 0, 0, 0, 3.5),
fscEvent(9900, 0, 0, 0, 1)
),
genetics = fscSettingsGenetics(fscBlock_snp(10, 2.5e-6), num.chrom = 200000),
label = "known.1PopBot20Mb"
)
obs.p <- fscRun(obs.p, dna.to.snp = TRUE, no.arl.output = TRUE, num.cores = 3)
2020-02-04 06:33:53 running fastsimcoal2...
2020-02-04 06:33:55 run complete
obs.sfs <- fscReadSFS(obs.p)
2020-02-04 06:33:55 reading files in /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/known.1PopBot20Mb/known.1PopBot20Mb_1
str(obs.sfs)
List of 3
$ sfs :List of 2
..$ marginal:List of 1
.. ..$ /var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/known.1PopBot20Mb/known.1PopBot20Mb_1/known.1PopBot20Mb_MAFpop0.obs: Named int [1:41] 1236340 107323 66910 52013 45307 40477 37823 36063 33832 32539 ...
.. .. ..- attr(*, "names")= chr [1:41] "d0_0" "d0_1" "d0_2" "d0_3" ...
..$ joint : NULL
$ polym.sites: Named num [1:5] 2000000 832225 0 37715 30850
..- attr(*, "names")= chr [1:5] "num.sim" "num.polym" "num.gt2.alleles" "num.fix.anc" ...
$ lhood : NULL
We specify the parameter estimation model as before by inserting character strings into the parameters to be estimated for each setting.
demes <- fscSettingsDemes(fscDeme("NCUR", 20))
events <- fscSettingsEvents(
fscEvent("TBOT", 0, 0, 0, "RESBOT"),
fscEvent("TENDBOT", 0, 0, 0, "RESENDBOT")
)
We then create the parameter estimation settings, and load the observed SFS:
est <- fscSettingsEst(
fscEstParam("NCUR", is.int = TRUE, distr = "unif", min = 10, max = 100000),
# default for is.int = TRUE and distr = "unif"
fscEstParam("NANC", min = 10, max = 100000),
fscEstParam("NBOT", min = 10, max = 100000),
fscEstParam("TBOT", min = 10, max = 10000),
# these are "complex parameters" (only name and value are given)
fscEstParam("RESBOT", is.int = FALSE, value = "NBOT/NCUR", output = FALSE),
fscEstParam("RESENDBOT", is.int = FALSE, value = "NANC/NBOT", output = FALSE),
fscEstParam("TENDBOT", value = "TBOT+100", output = FALSE),
obs.sfs = obs.sfs$sfs$marginal[[1]]
)
est
$params
is.int name dist min max value output bounded reference
1 1 NCUR unif 10 100000 <NA> output
2 1 NANC unif 10 100000 <NA> output
3 1 NBOT unif 10 100000 <NA> output
4 1 TBOT unif 10 10000 <NA> output
5 0 RESBOT <NA> <NA> <NA> NBOT/NCUR hide
6 0 RESENDBOT <NA> <NA> <NA> NANC/NBOT hide
7 1 TENDBOT <NA> <NA> <NA> TBOT+100 hide
$rules
NULL
$sfs
d0_0 d0_1 d0_2 d0_3 d0_4 d0_5 d0_6 d0_7 d0_8 d0_9
1236340 107323 66910 52013 45307 40477 37823 36063 33832 32539
d0_10 d0_11 d0_12 d0_13 d0_14 d0_15 d0_16 d0_17 d0_18 d0_19
31781 30434 30370 30248 29788 29486 28944 28911 28670 28297
d0_20 d0_21 d0_22 d0_23 d0_24 d0_25 d0_26 d0_27 d0_28 d0_29
14444 0 0 0 0 0 0 0 0 0
d0_30 d0_31 d0_32 d0_33 d0_34 d0_35 d0_36 d0_37 d0_38 d0_39
0 0 0 0 0 0 0 0 0 0
d0_40
0
attr(,"sfs.type")
[1] "MAF"
attr(,"class")
[1] "fscSettingsEst" "list"
We then write and run the model. Note that below, we only run it for 10000 iterations, however to produce stable estimates, many more (> 1000000) should be run.
est.p <- fscWrite(
demes = demes,
events = events,
genetics = fscSettingsGenetics(fscBlock_freq(2.5e-6)),
est = est,
label = "est.1PopBot20Mb"
)
est.p <- fscRun(est.p, num.sims = 10000, num.cores = 3)
2020-02-04 06:33:55 running fastsimcoal2...
2020-02-04 06:34:04 run complete
This run produces several different files to the simulation folder:
dir(est.p$folder, pattern = est.p$label)
[1] "est.1PopBot20Mb" "est.1PopBot20Mb_MAFpop0.obs"
[3] "est.1PopBot20Mb.est" "est.1PopBot20Mb.log"
[5] "est.1PopBot20Mb.par" "est.1PopBot20Mb.tpl"
We read them in to R with fscReadParamEst()
:
param.est <- fscReadParamEst(est.p)
str(param.est)
List of 3
$ sfs :List of 2
..$ marginal: num [1:41, 1] 0 0.1224 0.078 0.0657 0.0581 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:41] "d0_0" "d0_1" "d0_2" "d0_3" ...
.. .. ..$ : chr "/var/folders/f6/2l43x9sd3wb_y4tjvltz1pfc0000gp/T//Rtmp4pR9Jz/est.1PopBot20Mb/est.1PopBot20Mb_MAFpop0.txt"
..$ joint : NULL
$ max.lhoods: NULL
$ ecm.lhoods:'data.frame': 80 obs. of 6 variables:
..$ Param# : num [1:80] 0 1 2 3 0 1 2 3 0 1 ...
..$ NCUR : num [1:80] 6813 6813 6813 6813 6813 ...
..$ NANC : num [1:80] 86738 84277 84277 84277 84277 ...
..$ NBOT : num [1:80] 49082 49082 83172 83172 83172 ...
..$ TBOT : num [1:80] 8252 8252 8252 8592 8592 ...
..$ MaxEstLhood: num [1:80] -1541719 -1541633 -1541462 -1541615 -1541699 ...
If the model contains more than one deme, then the only difference is that a list of the observed joint SFS with fastsimcoal2
formatted row and column names must be provided to the obs.sfs
argument of fscSettingsEst()
.