keremRand {SHARE} | R Documentation |
This datasets contains the psudo-subjects created from cystic fibrosis data in Kerem et al. (1989).
data(keremRand)
Here is the list of the 23 alleles:
locus_01
locus_02
locus_03
locus_04
locus_05
locus_06
locus_07
locus_08
locus_09
locus_10
locus_11
locus_12
locus_13
locus_14
locus_15
locus_16
locus_17
locus_18
locus_19
locus_20
locus_21
locus_22
locus_23
SHARE algorithm requires subject-level information, i.e., it needs to know the haplotype/genotype sequences of every subjects in both case and control groups. However, the original data in Kerem et al. (1989) only provide the sequence-level information, meaning that we only know what group (case/control) each haplotype sequence belongs to. We need to simulate subject-level information to demostrate SHARE algorithm. Two haplotypes with the same clinical status (having cystic fibrosis or not) are then ramdonly paired to form a psudo-subject with the that status.
Three objects will be attached after loading the dataset keremRand:
The data.frame object keremRandSeq
contains 186 sequences with
23 SNPs. The row names show the subject id and the sequence id within
this subject. The SNPs are coded as 1 referring to the large allele
of the RFLP, and 2 referring to the smaller allele.
The vector object keremRandStatus
provides the CF/control
status of each subject. 1 indicates subjects in case group (i.e.,
CF), and 0 indicates control group. There are 47 subjects in CF group
and 46 in control group.
The data.frame object keremRandAllele
contains allelic data for
23 SNPs, coded as 0, 1, 2 as the number of minor alleles.
How these three objects were created is shown in the example section.
This dataset was originally released in Kerem et al. (1989), and was converted to R objects in Browning (2006). Browning's dataset could be found in the HapVLMC package (http://www.stat.auckland.ac.nz/~browning/HapVLMC/index.htm).
S. R. Browning. Multilocus association mapping using variable-length markov chains. American Journal of Human Genetics, 78(6):903-913, Jun 2006.
B. Kerem, J. M. Rommens, J. A. Buchanan, D. Markiewicz, T. K. Cox, A. Chakravarti, M. Buchwald, and L. C. Tsui. Identification of the cystic fibrosis gene: genetic analysis. Science (New York, N.Y.), 245(4922):1073-1080, Sep 8 1989.
## Not run: ## Here are how the psudo-subjects are simulated #### loading HapVLMC package and the dataset library(HapVLMC) data(Kerem) set.seed(20090313) randOrder <- runif(nrow(kerem.snps.data)) keremRandSeq <- rbind(## randomly order the TRUE part kerem.snps.data[kerem.status, ][order(randOrder[kerem.status]), ], ## randomly order the FALSE part kerem.snps.data[!kerem.status, ][order(randOrder[!kerem.status]), ] ) nLoci <- ncol(keremRandSeq) lociNum <- unlist(sapply(1:nLoci, function(x){ paste(paste( rep("0", ceiling(log10(nLoci)) - nchar(as.character(x))), collapse=""), x, sep="", collapse="") }) ) colnames(keremRandSeq) <- paste("locus_", lociNum, sep="") nSubj <- nrow(keremRandSeq)/2 subjNum <- unlist(sapply(1:nSubj, function(x){ paste(paste( rep("0", ceiling(log10(nSubj)) - nchar(as.character(x))), collapse=""), x, sep="", collapse="") }) ) subjLabel <- paste("subj_", subjNum, sep="") seqLabel <- paste("seq", 1:2, sep="_") rownames(keremRandSeq) <- paste(rep(subjLabel, each=2), seqLabel, sep="_") keremRandStatus <- c(rep(1, sum(kerem.status)/2), rep(0, sum(!kerem.status)/2)) keremRandAllele <- NULL for(i in seq(1, nrow(keremRandSeq), by=2)){ keremRandAllele <- rbind(keremRandAllele, apply(keremRandSeq[c(i, i+1), ], 2, function(x){ ## counting how many small alleles sum(x==2) } ) ) } rownames(keremRandAllele) <- unique(gsub("^(subj_.*)_seq_(.*)$", "\1", rownames(keremRandSeq))) ## End(Not run) ## load keremRand data(keremRand) ## check which objects are attached ls() ## dimention of psedu-subject data dim(keremRandSeq) ## number of CF (TRUE) and control (FALSE) subjects table(keremRandStatus)