It is easiest to load raw genotype data from the disk if it is available as a text file, usually in comma-delimited (.csv) format. The standard R functions read.table or read.csv can be used to accomplish this. However, in strataG, the readGenData function has been provided for .csv files, which is a wrapper for read.csv that sets commonly used values for missing data and removes blank lines.
## Warning in data.table::fread(file = file, header = TRUE, na.strings =
## na.strings, : na.strings[4]==" " consists only of whitespace, ignoring.
## strip.white==TRUE (default) and "" is present in na.strings, so any number of
## spaces in string columns will already be read as <NA>.
## 'data.frame': 126 obs. of 12 variables:
## $ ids : chr "4495" "4496" "4498" "5814" ...
## $ strata : chr "Offshore.North" "Offshore.North" "Offshore.North" "Offshore.North" ...
## $ D11t.1 : chr "133" "137" "133" "135" ...
## $ D11t.2 : chr "133" "141" "133" "139" ...
## $ EV37.1 : chr NA NA "220" "210" ...
## $ EV37.2 : chr NA NA "220" "226" ...
## $ EV94.1 : chr NA "243" "245" "251" ...
## $ EV94.2 : chr NA "251" "265" "265" ...
## $ Ttr11.1: chr "211" NA "209" "209" ...
## $ Ttr11.2: chr "211" NA "211" "209" ...
## $ Ttr34.1: chr "185" "189" "189" "185" ...
## $ Ttr34.2: chr "191" "189" "189" "187" ...
## - attr(*, ".internal.selfref")=<externalptr>
For sequence data stored in FASTA format, the read.fasta function is available, which is a wrapper for the read.dna function in the ape package with standard FASTA arguments set. This will create a DNAbin object in the workspace:
## 126 DNA sequences in binary format stored in a list.
##
## All sequences of same length: 402
##
## Labels:
## 4495
## 4496
## 4498
## 5814
## 5815
## 5816
## ...
##
## Base composition:
## a c g t
## 0.301 0.229 0.129 0.341
## (Total: 50.65 kb)
For sequences stored in other formats, read.dna should be used directly.
For most functions in strataG, you will need to load your data into a gtypes object. A gtypes object is an R S4 class with several slots that are fully described in ?gtypes
.
The easiest way to create a gtypes object is with the df2gtypes() function. This function assumes that you have a matrix or data.frame with columns for individual ids, stratification, and locus data. You then specify the columns in the data.frame where this information can be found. df2gtypes() can be used for data with multiple alleles per locus, like this:
# create a single data.frame with the msat data and stratification
msats.merge <- merge(dolph.strata, dolph.msats, all.y = TRUE, description = date())
str(msats.merge)
## 'data.frame': 126 obs. of 14 variables:
## $ id : chr "18650" "18651" "18652" "18653" ...
## $ dLoop : chr "Hap.06" "Hap.14" "Hap.23" "Hap.24" ...
## $ broad : chr "Offshore" "Offshore" "Offshore" "Offshore" ...
## $ fine : chr "Offshore.South" "Offshore.South" "Offshore.South" "Offshore.South" ...
## $ D11t.1 : chr "131" "137" "131" "133" ...
## $ D11t.2 : chr "133" "143" "133" "139" ...
## $ EV37.1 : chr "212" "200" "212" "202" ...
## $ EV37.2 : chr "222" "228" "218" "220" ...
## $ EV94.1 : chr "263" "249" "249" "229" ...
## $ EV94.2 : chr "265" "251" "251" "245" ...
## $ Ttr11.1: chr "197" "197" "211" "197" ...
## $ Ttr11.2: chr "209" "197" "213" "215" ...
## $ Ttr34.1: chr "185" "185" "185" "189" ...
## $ Ttr34.2: chr "187" "187" "191" "195" ...
# create the gtypes object
msats.fine <- df2gtypes(msats.merge, ploidy = 2, id.col = 1, strata.col = 3, loc.col = 5)
…or for haploid data, like this:
data(dolph.seqs)
seq.df <- dolph.strata[ c("id", "broad", "id")]
colnames(seq.df)[3] <- "D-loop"
dl.g <- df2gtypes(seq.df, ploidy = 1, sequences = dolph.seqs)
dl.g
##
## <<< gtypes created on 2020-02-23 16:30:25 >>>
##
## Contents: 126 samples, 1 locus, 2 strata
##
## Strata summary:
## stratum num.ind num.missing num.haplotypes
## 1 Coastal 68 0 68
## 2 Offshore 58 0 58
##
## Sequence summary:
## locus num.seqs mean.length mean.num.ns mean.num.indels
## 1 D-loop 126 402 0 0
Note that since each sequence in dolph.seqs is for a given individual, the num.ind and num.haplotypes values are the same for both strata. In order to convert the sequences to unique haplotypes, use the labelHaplotypes() function:
##
## <<< gtypes created on 2020-02-23 16:30:25 >>>
##
## Contents: 126 samples, 1 locus, 2 strata
## Other info: haps.unassigned
##
## Strata summary:
## stratum num.ind num.missing num.haplotypes
## 1 Coastal 68 0 5
## 2 Offshore 58 0 29
##
## Sequence summary:
## locus num.seqs mean.length mean.num.ns mean.num.indels
## 1 D-loop 33 402 0 0
The sequence2gtypes() function creates an unstratified gtype object with just a set of DNA sequences:
##
## <<< gtypes created on 2020-02-23 16:30:26 >>>
##
## Contents: 33 samples, 1 locus, 1 stratum
##
## Strata summary:
## stratum num.ind num.missing num.haplotypes
## 1 Default 33 0 33
##
## Sequence summary:
## locus num.seqs mean.length mean.num.ns mean.num.indels
## 1 gene1 33 402 0 0
If you have a vector that identifies strata designations for the sequences, that can be supplied as well:
# extract and name the stratification scheme
strata <- dolph.strata$fine
names(strata) <- dolph.strata$ids
# create the gtypes object
dloop.fine <- sequence2gtypes(dolph.seqs, strata, seq.names = "dLoop",
description = "dLoop: fine-scale stratification")
dloop.fine
##
## <<< dLoop: fine-scale stratification >>>
##
## Contents: 126 samples, 1 locus, 3 strata
##
## Strata summary:
## stratum num.ind num.missing num.haplotypes
## 1 Coastal 68 0 68
## 2 Offshore.North 40 0 40
## 3 Offshore.South 18 0 18
##
## Sequence summary:
## locus num.seqs mean.length mean.num.ns mean.num.indels
## 1 dLoop 126 402 0 0
Note that stratification is generally provided for individuals, thus if you want to stratify the resulting gtypes object from sequence2gtypes(), one sequence for each individual should be provided, rather than just a set of unique haplotypes.
THere are conversion functions for data objects from several other popular packages in R, such as adegenet(genind), pegas(loci), and phangorn(phydat).
## Loading required package: ade4
##
## /// adegenet 2.1.2 is loaded ////////////
##
## > overview: '?adegenet'
## > tutorials/doc/questions: 'adegenetWeb()'
## > bug reports/feature requests: adegenetIssues()
# from example(df2genind)
df <- data.frame(locusA=c("11","11","12","32"),
locusB=c(NA,"34","55","15"),
locusC=c("22","22","21","22"))
row.names(df) <- .genlab("genotype",4)
obj <- df2genind(df, ploidy=2, ncode=1)
obj
## /// GENIND OBJECT /////////
##
## // 4 individuals; 3 loci; 9 alleles; size: 6.5 Kb
##
## // Basic content
## @tab: 4 x 9 matrix of allele counts
## @loc.n.all: number of alleles per locus (range: 2-4)
## @loc.fac: locus factor for the 9 columns of @tab
## @all.names: list of allele names for each locus
## @ploidy: ploidy of each individual (range: 2-2)
## @type: codom
## @call: df2genind(X = df, ncode = 1, ploidy = 2)
##
## // Optional content
## - empty -
##
## <<< gtypes created on 2020-02-23 16:30:26 >>>
##
## Contents: 4 samples, 3 loci, 1 stratum
## Other info: genind
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Default 4 0.3333333 3
There are several functions for getting basic information from a gtypes object (see ?accessors
):
Some functions are available for modifying values in the object as well, such as:
A gtypes object can be subset using the standard R ‘[’ indexing operation, with three slots: [i, j, k]. The first (i) specifies the desired individuals, the second (j) is the loci to return, and the third (k) is the strata. All standard R indexing operations involving numerical, character, or logical vectors work for each argument. For example, to return 10 random individuals:
##
## <<< gtypes created on 2020-02-23 16:30:25 >>>
##
## Contents: 10 samples, 5 loci, 2 strata
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Coastal 5 0 3.4
## 2 Offshore 5 0 6.4
…or to return specific loci:
##
## <<< gtypes created on 2020-02-23 16:30:25 >>>
##
## Contents: 10 samples, 2 loci, 2 strata
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Coastal 5 0 2.5
## 2 Offshore 5 0 6.5
…or some loci in a specific stratum:
##
## <<< gtypes created on 2020-02-23 16:30:25 >>>
##
## Contents: 68 samples, 2 loci, 1 stratum
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Coastal 68 0.5 3.5
Several functions have been defined for gtypes, that provide summaries for individuals (summarizeInds()
), loci (summarizeLoci()
), and sequences (summarizeSeqs()
):
## locus num.genotyped num.missing prop.genotyped num.alleles allelic.richness
## 1 D11t 125 1 0.9920635 12 0.09600000
## 2 EV37 119 7 0.9444444 22 0.18487395
## 3 EV94 125 1 0.9920635 15 0.12000000
## 4 Ttr11 125 1 0.9920635 9 0.07200000
## 5 Ttr34 126 0 1.0000000 10 0.07936508
## prop.unique.alleles exptd.het obsvd.het
## 1 0.03200000 0.7468273 0.6984127
## 2 0.02521008 0.8270751 0.6587302
## 3 0.01600000 0.8304900 0.7698413
## 4 0.02400000 0.7953414 0.6984127
## 5 0.01587302 0.8144248 0.6984127
## id stratum num.loci.genotyped num.loci.missing.genotypes
## 1 23945 Coastal 5 0
## 2 25503 Coastal 5 0
## 3 25509 Coastal 5 0
## 4 40915 Coastal 5 0
## 5 40916 Coastal 4 1
## 6 41538 Coastal 5 0
## 7 41539 Coastal 5 0
## 8 41540 Coastal 5 0
## 9 41578 Coastal 5 0
## 10 41579 Coastal 4 1
## 11 41819 Coastal 5 0
## 12 41820 Coastal 5 0
## 13 41821 Coastal 5 0
## 14 41822 Coastal 5 0
## 15 42192 Coastal 4 1
## 16 42193 Coastal 4 1
## 17 44718 Coastal 5 0
## 18 44719 Coastal 5 0
## 19 44720 Coastal 5 0
## 20 44721 Coastal 5 0
## 21 45229 Coastal 5 0
## 22 45230 Coastal 5 0
## 23 45231 Coastal 5 0
## 24 45232 Coastal 5 0
## 25 45233 Coastal 5 0
## 26 45234 Coastal 5 0
## 27 45236 Coastal 5 0
## 28 45237 Coastal 4 1
## 29 49095 Coastal 5 0
## 30 51981 Coastal 4 1
## 31 51982 Coastal 5 0
## 32 78033 Coastal 5 0
## 33 78034 Coastal 5 0
## 34 78035 Coastal 5 0
## 35 78036 Coastal 5 0
## 36 78037 Coastal 5 0
## 37 78038 Coastal 5 0
## 38 78039 Coastal 5 0
## 39 78040 Coastal 5 0
## 40 78041 Coastal 5 0
## 41 78042 Coastal 5 0
## 42 78043 Coastal 5 0
## 43 78044 Coastal 5 0
## 44 78045 Coastal 5 0
## 45 78046 Coastal 5 0
## 46 78047 Coastal 5 0
## 47 78048 Coastal 5 0
## 48 78049 Coastal 5 0
## 49 78050 Coastal 5 0
## 50 78051 Coastal 5 0
## 51 78052 Coastal 5 0
## 52 78053 Coastal 5 0
## 53 78054 Coastal 5 0
## 54 78055 Coastal 5 0
## 55 78056 Coastal 5 0
## 56 78057 Coastal 5 0
## 57 78058 Coastal 5 0
## 58 78059 Coastal 5 0
## 59 78060 Coastal 5 0
## 60 78061 Coastal 5 0
## 61 78062 Coastal 5 0
## 62 78063 Coastal 5 0
## 63 78064 Coastal 5 0
## 64 78065 Coastal 5 0
## 65 78066 Coastal 5 0
## 66 78067 Coastal 5 0
## 67 78068 Coastal 5 0
## 68 78069 Coastal 5 0
## 69 18650 Offshore 5 0
## 70 18651 Offshore 5 0
## 71 18652 Offshore 5 0
## 72 18653 Offshore 5 0
## 73 18654 Offshore 5 0
## 74 18655 Offshore 5 0
## 75 23792 Offshore 5 0
## 76 23793 Offshore 5 0
## 77 23794 Offshore 5 0
## 78 23801 Offshore 5 0
## 79 25182 Offshore 5 0
## 80 25184 Offshore 5 0
## 81 25185 Offshore 5 0
## 82 25186 Offshore 5 0
## 83 25469 Offshore 5 0
## 84 25470 Offshore 5 0
## 85 25471 Offshore 5 0
## 86 26304 Offshore 5 0
## 87 26305 Offshore 5 0
## 88 26310 Offshore 5 0
## 89 26316 Offshore 5 0
## 90 26317 Offshore 5 0
## 91 26318 Offshore 5 0
## 92 26320 Offshore 5 0
## 93 31888 Offshore 5 0
## 94 41757 Offshore 5 0
## 95 41758 Offshore 5 0
## 96 41759 Offshore 5 0
## 97 4495 Offshore 3 2
## 98 4496 Offshore 3 2
## 99 4498 Offshore 5 0
## 100 50742 Offshore 5 0
## 101 50743 Offshore 5 0
## 102 50744 Offshore 5 0
## 103 50745 Offshore 5 0
## 104 50746 Offshore 5 0
## 105 51382 Offshore 5 0
## 106 51383 Offshore 5 0
## 107 51384 Offshore 5 0
## 108 5814 Offshore 5 0
## 109 5815 Offshore 5 0
## 110 5816 Offshore 5 0
## 111 5817 Offshore 5 0
## 112 5818 Offshore 5 0
## 113 6151 Offshore 5 0
## 114 6153 Offshore 5 0
## 115 6290 Offshore 5 0
## 116 74959 Offshore 5 0
## 117 74960 Offshore 5 0
## 118 74961 Offshore 5 0
## 119 74962 Offshore 5 0
## 120 74963 Offshore 5 0
## 121 74964 Offshore 5 0
## 122 74965 Offshore 5 0
## 123 74966 Offshore 5 0
## 124 78530 Offshore 5 0
## 125 78531 Offshore 5 0
## 126 78532 Offshore 5 0
## pct.loci.missing.genotypes pct.loci.homozygous
## 1 0.0 0.4
## 2 0.0 0.6
## 3 0.0 0.8
## 4 0.0 0.6
## 5 0.2 0.0
## 6 0.0 0.4
## 7 0.0 0.6
## 8 0.0 0.0
## 9 0.0 0.0
## 10 0.2 0.2
## 11 0.0 0.0
## 12 0.0 0.2
## 13 0.0 0.8
## 14 0.0 0.6
## 15 0.2 0.4
## 16 0.2 0.0
## 17 0.0 0.6
## 18 0.0 0.2
## 19 0.0 0.2
## 20 0.0 0.2
## 21 0.0 0.4
## 22 0.0 0.0
## 23 0.0 0.6
## 24 0.0 0.6
## 25 0.0 0.2
## 26 0.0 0.2
## 27 0.0 0.6
## 28 0.2 0.2
## 29 0.0 0.4
## 30 0.2 0.6
## 31 0.0 0.4
## 32 0.0 0.0
## 33 0.0 0.0
## 34 0.0 0.2
## 35 0.0 1.0
## 36 0.0 0.4
## 37 0.0 0.2
## 38 0.0 0.8
## 39 0.0 0.0
## 40 0.0 0.4
## 41 0.0 0.0
## 42 0.0 0.2
## 43 0.0 0.6
## 44 0.0 0.2
## 45 0.0 0.4
## 46 0.0 0.4
## 47 0.0 0.0
## 48 0.0 0.2
## 49 0.0 0.4
## 50 0.0 0.4
## 51 0.0 0.6
## 52 0.0 0.4
## 53 0.0 0.2
## 54 0.0 0.2
## 55 0.0 0.8
## 56 0.0 0.4
## 57 0.0 0.2
## 58 0.0 0.2
## 59 0.0 0.8
## 60 0.0 0.6
## 61 0.0 0.4
## 62 0.0 0.4
## 63 0.0 0.2
## 64 0.0 0.6
## 65 0.0 0.4
## 66 0.0 0.6
## 67 0.0 0.4
## 68 0.0 0.4
## 69 0.0 0.0
## 70 0.0 0.2
## 71 0.0 0.0
## 72 0.0 0.0
## 73 0.0 0.2
## 74 0.0 0.0
## 75 0.0 0.0
## 76 0.0 0.2
## 77 0.0 0.2
## 78 0.0 0.0
## 79 0.0 0.0
## 80 0.0 0.4
## 81 0.0 0.0
## 82 0.0 0.4
## 83 0.0 0.0
## 84 0.0 0.4
## 85 0.0 0.0
## 86 0.0 0.2
## 87 0.0 0.6
## 88 0.0 0.2
## 89 0.0 0.2
## 90 0.0 0.2
## 91 0.0 0.2
## 92 0.0 0.0
## 93 0.0 0.0
## 94 0.0 0.4
## 95 0.0 0.2
## 96 0.0 0.2
## 97 0.4 0.4
## 98 0.4 0.2
## 99 0.0 0.6
## 100 0.0 0.6
## 101 0.0 0.2
## 102 0.0 0.0
## 103 0.0 0.2
## 104 0.0 0.2
## 105 0.0 0.6
## 106 0.0 0.0
## 107 0.0 0.6
## 108 0.0 0.2
## 109 0.0 0.6
## 110 0.0 0.0
## 111 0.0 0.2
## 112 0.0 0.0
## 113 0.0 0.0
## 114 0.0 0.0
## 115 0.0 0.0
## 116 0.0 0.0
## 117 0.0 0.4
## 118 0.0 0.0
## 119 0.0 0.0
## 120 0.0 0.4
## 121 0.0 0.0
## 122 0.0 0.0
## 123 0.0 0.2
## 124 0.0 0.2
## 125 0.0 0.4
## 126 0.0 0.0
You can specify the stratification scheme when creating a gtypes object as in the examples above. Once a gtypes object has been created, you can also change the stratification scheme by either supplying a new vector for the @strata slot:
# randomly stratify individuals to two populations
msats <- msats.g
new.strata <- sample(c("Pop1", "Pop2"), getNumInd(msats), rep = TRUE)
names(new.strata) <- getIndNames(msats)
setStrata(msats) <- new.strata
msats
##
## <<< dolphin msats >>>
##
## Contents: 126 samples, 5 loci, 2 strata
## Stratification schemes: broad, fine
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Pop1 65 1 12.2
## 2 Pop2 61 1 11.6
or, if there is a stratification scheme data.frame in the @schemes slot, you can use the stratify function to choose a stratification scheme:
##
## <<< dolphin msats >>>
##
## Contents: 126 samples, 5 loci, 2 strata
## Stratification schemes: broad, fine
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Coastal 68 1.2 4.8
## 2 Offshore 58 0.8 13.6
You can update the @schemes slot with data.frame like this:
new.schemes <- getSchemes(msats)
new.schemes$ran.pop <- sample(c("Pop5", "Pop6"), getNumInd(msats), rep = TRUE)
setSchemes(msats) <- new.schemes
NOTE: Filling or changing the @schemes slot does not affect the current stratification of the samples. You must then select a new stratification scheme or fill the @strata slot as above.
##
## <<< dolphin msats >>>
##
## Contents: 126 samples, 5 loci, 2 strata
## Stratification schemes: broad, fine, ran.pop
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Pop5 59 0.8 11.6
## 2 Pop6 67 1.2 12.2
If some samples should be unstratified (excluded from any stratified analyses), they should have NAs in the appropriate position in the @strata slot. For example:
# unstratify a random 10 samples
x <- getStrata(msats)
x[sample(getIndNames(msats), 10)] <- NA
msats
##
## <<< dolphin msats >>>
##
## Contents: 126 samples, 5 loci, 2 strata
## Stratification schemes: broad, fine, ran.pop
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Coastal 68 1.2 4.8
## 2 Offshore 58 0.8 13.6
You can also randomly permute the current stratification scheme using the permuteStrata() function like this:
##
## <<< dolphin msats >>>
##
## Contents: 126 samples, 5 loci, 3 strata
## Stratification schemes: broad, fine, ran.pop
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Coastal 68 1.2 4.8
## 2 Offshore.North 40 0.8 12.6
## 3 Offshore.South 18 0.0 11.0
##
## <<< dolphin msats >>>
##
## Contents: 126 samples, 5 loci, 3 strata
## Stratification schemes: broad, fine, ran.pop
##
## Strata summary:
## stratum num.ind num.missing num.alleles
## 1 Coastal 68 1.4 12.0
## 2 Offshore.North 40 0.4 11.0
## 3 Offshore.South 18 0.2 7.2
NOTE: Only samples assigned to strata are permuted with permuteStrata. Those not assigned (NAs) remain unassigned.
The allelic data in a gtypes object can be converted back to a matrix or data frame with as.matrix() and as.data.frame():
## id stratum D11t.1 D11t.2 EV37.1 EV37.2 EV94.1 EV94.2 Ttr11.1
## [1,] "18650" "Offshore.South" "131" "133" "212" "222" "263" "265" "197"
## [2,] "18651" "Offshore.South" "137" "143" "200" "228" "249" "251" "197"
## [3,] "18652" "Offshore.South" "131" "133" "212" "218" "249" "251" "211"
## [4,] "18653" "Offshore.South" "133" "139" "202" "220" "229" "245" "197"
## [5,] "18654" "Offshore.South" "131" "135" "214" "216" "249" "249" "197"
## [6,] "18655" "Offshore.South" "131" "137" "222" "254" "243" "255" "207"
## Ttr11.2 Ttr34.1 Ttr34.2
## [1,] "209" "185" "187"
## [2,] "197" "185" "187"
## [3,] "213" "185" "191"
## [4,] "215" "189" "195"
## [5,] "211" "185" "187"
## [6,] "211" "183" "193"
By default, this function splits each allele into its own column. One can make a matrix with one locus per column and alleles separated by a specified character by setting the one.col argument to TRUE:
## id stratum D11t EV37 EV94 Ttr11 Ttr34
## [1,] "18650" "Offshore.South" "131/133" "212/222" "263/265" "197/209" "185/187"
## [2,] "18651" "Offshore.South" "137/143" "200/228" "249/251" "197/197" "185/187"
## [3,] "18652" "Offshore.South" "131/133" "212/218" "249/251" "211/213" "185/191"
## [4,] "18653" "Offshore.South" "133/139" "202/220" "229/245" "197/215" "189/195"
## [5,] "18654" "Offshore.South" "131/135" "214/216" "249/249" "197/211" "185/187"
## [6,] "18655" "Offshore.South" "131/137" "222/254" "243/255" "207/211" "183/193"
The contents of a gtypes object can be written to a file with the writeGtypes() function. This will write a .csv file with the allelic information and a .fasta file for any sequence data if it exists.