pSegment {ADaCGH}R Documentation

Parallelized versions of several aCGH segementation algorithms/methods

Description

These functions parallelize several segmentation algorithms. is a parallelized version (using papply and Rmpi) of segment.

Usage


pSegmentDNAcopy(x, chrom.numeric, parall = "arr", ...)

pSegmentACE(x, chrom.numeric, parall = "auto", ...)

pSegmentHMM(x, chrom.numeric, parall = "auto",...)

pSegmentGLAD(x, chrom.numeric, ...)

pSegmentBioHMM(x, chrom.numeric, Pos, parall = "auto", ...)

pSegmentCGHseg(x, chrom.numeric, CGHseg.thres, ...)

pSegmentPSW(x, chrom.numeric, common.data,
            sign = -1,
            nIter = 1000, prec = 100,  p.crit = 0.10,
            name = NULL, ...)

pSegmentWavelets(x, chrom.numeric, mergeSegs = TRUE,
                minDiff = 0.25,
                minMergeDiff = 0.05,
                thrLvl = 3, initClusterLevels = 10, ...)


Arguments

x The aCGH data; genes in rows, subjects or arrays in columns. Data should be ordered by chromosome and position within chromosome.
chrom.numeric The (numeric) vector with the chromosome indicator
Pos The location (e.g., position in kbases) of each probe in the chromosome. A vector of the same length as the number of rows in x
parall Where available whether to parallelize over arrays (all except ACE) or chromosomes (ACE) or array by chromosome (ACE, DNAcopy, HMM, BioHMM.
CGHseg.thres The threshold for the adaptive penalization in Picard et al.'s CGHseg. See p. 13 of the original paper. Must be a negative number.
mergeSegs Merge (genome-wide) the resulting segments using mergeLevels?
common.data A data frame with columns that contain the common information about the clones/genes. For now, this is limited to five columns of data. We do nothing with this, except it is part of the output. It is common practice that this data frame contains columns ID, Chrosome (need not be numeric, can be a factor), Start (the initial position of a clone or gene), End (the final position of a clone or gene) and MidPoint (often, the mid point or whatever you regard as appropriate representative of the "center" of the clone). But you can use something else.
sign See sw.threshold
nIter See sw.perm.test
prec See sw.rob
p.crit The largest p-value for which we want a region to be shown, in red, in the plot.
name Sample name or ID
minDiff Minimum (absolute) difference between the medians of two adjacent clusters for them to be considered truly different. Clusters "closer" together than this are collapsed together to form a single cluster.
minMergeDiff Used only when doing merging in the wavelet method. The finall call as to which segments go together is done by a mergeLevels approach, but an initial collapsing of very close values is performed (otherwise, we could end up passing to mergeLevels as many initial levels as there are points).
thrLvl The level used for the wavelet thresholding.
initClusterLevels The initial number of clusters to form.
... Additional arguments; not used.

Details

In most cases, these are wrappers to the original code, with modifications for parallelization. For BioHMM and HMM the results are merged, by the mergeLevels algorithm, as recommended in Willenbrock and Fridlyand, 2005. Merging is also done in GLAD (with GLAD's own merging algorithm). Merging is the default for DNAcopy (but this can be set by the user). Merging can also be used with the wavelet-based method; please note that the later is an experimental feature implemented by us, and there is no evidence of its performance.

For PSW, the method implemented here is a small departure from the one described originally in Price et al.: here we use all the data for a subject/array for the thresholding, but subsequent analyses (finding islands and robustness analysis and permutation tests) are carried out chromosome by chromosome. This should allow to detect cases where a complete chromosome is gained or lost, as well as cases where smaller regions are gained or lost. Note that this approach is probably not advisable for the X chromosome, at least for males. You probably want to analyze the X separately, since you only expect one copy of the genomic DNA in males. Finally, the function for calculating the threshold (the default used by T. Price originally, which we also use —median + 0.2 * MAD—) was validated by Price and collaborators on a 75-probe dataset that spanned the telomeric 2MB of chromosome 16p, using subjects with well-characterized deletions in the region and healthy controls. But it has not been validated with whole-genome chips or other DNA sources. (If you want to play around with the code, the sources are available from http://www.well.ox.ac.uk/~tprice/cgh/ and as the R package cgh). (I thank Tom Price for his discussion on these issues; some sentences here are copied almost verbatim from his emails.)

Beware that, with DNAcopy, parallelization over arrays by chromosome can lead to slightly different results than parallelization over arrays (and, thus, to the original function). This is because of the way the smoothing is carried out, which affects the timmed.SD used.

Value

All the functions return an object of class "adacgh.generic.out". In addition, depending on the function, the returned object can also be of class "CGH.wave.merged","CGH.wave", "mergedHMM", "adacghGLAD", "mergedBioHMM", "CGHseg", "CGH.PSW".
For methods DNAcopy, HMM, BioHMM, GLAD, CGHseg, and wavelet smoothing the output object is a list with two components:

segm The output from the segmentation
chrom.numeric The same value as chrom.numeric in the function call.


The segm component is a list, with as many components as subjects or arrays. Each of this lists is a matrix with three columns. The first are the observed data, the second the predicted/smoothed/merged data, and the third the state/alteration. When the method performs a merging step, we call the third column "Alteration", and it can only take three values, -1 (loss of genomic DNA), 0, +1 (gain of genomic DNA).
Method PSW's output contains two components. The second, plotData with data that is used for plotting (and the details are not docummented, since you are not supposed to use them, as this is subject to change). The first component, Data, is a list with as many components as arrays or subjects. Each of this is an array with columns with original data, the sign, the value from the robustness test, and the p-value from the permutation test. See details in sw.threshold, sw.perm.test, sw.rob.
Method ACE's output is a list. None of the elements of this list are to be used directly by a user, and are later used by the function summary.ACE.

Author(s)

The code for DNAcopy, HMM, BioHMM, PSW, and GLAD are basically wrappers around the original functions by their corresponding authors, with some modiffications for parallelization (packages DNAcopy, aCGH, snapCGH, cgh, GLAD, respectively). The CGHseg method uses package tilingArray.

For the wavelet-based method we have only wrapped the code that was kindly provided by L. Hsu and D. Grove, and parallelized a few calls. Their original code is included in the sources of the package.

The code for ACE is based on the original Java code for the CGHExplorer package. It was turned into C and R code by Oscar M. Rueda omrueda@cnio.es.

Parallelization and other modifications and additions are by Ramon Diaz-Uriarte rdiaz02@gmail.com

References

Fridlyand, Jane and Snijders, Antoine M. and Pinkel, Dan and Albertson, Donna G. (2004). Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis, 90: 132–153.

Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P. (2005) Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics, 6:211-26.

Hupe, P. and Stransky, N. and Thiery, J. P. and Radvanyi, F. and Barillot, E. (2004). Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20: 3413–3422.

Lingjaerde OC, Baumbusch LO, Liestol K, Glad I, Borresen-Dale AL. (2005). CGH-Explorer: a program for analysis of CGH-data. Bioinformatics, 21: 821–822.

Marioni, J. C. and Thorne, N. P. and Tavare, S. (2006). BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics, 22: 1144–1146.

Olshen, A. B. and Venkatraman, E. S. and Lucito, R. and Wigler, M. (2004) Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics, 4, 557–572. http://www.mskcc.org/biostat/~olshena/research.

Picard, F. and Robin, S. and Lavielle, M. and Vaisse, C. and Daudin, J. J. (2005). A statistical approach for array CGH data analysis. BMC Bioinformatics, 6, 27. http://dx.doi.org/10.1186/1471-2105-6-27.

Price TS, Regan R, Mott R, Hedman A, Honey B, Daniels RJ, Smith L, Greenfield A, Tiganescu A, Buckle V, Ventress N, Ayyub H, Salhan A, Pedraza-Diaz S, Broxholme J, Ragoussis J, Higgs DR, Flint J, Knight SJ. (2005) SW-ARRAY: a dynamic programming solution for the identification of copy-number changes in genomic DNA using array comparative genome hybridization data. Nucleic Acids Res., 33:3455-64.

Willenbrock, H. and Fridlyand, J. (2005). A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics, 21, 4084–4091.

Diaz-Uriarte, R. and Rueda, O.M. (2006). ADaCGH: an R package and web-based application for the analysis of aCGH data. Tech. report. http://www.ligarto.org/rdiaz/Papers/adacgh.pdf, http://adacgh.bioinfo.cnio.es.

See Also

segmentPlot

Examples

## Not run: 
setwd("/tmp/o3") ## all slaves need a common dir to read and write.
mpiInit()
## End(Not run)

## library(Rmpi)
## try(mpi.spawn.Rslaves(nslaves = 1), silent = TRUE)

data(cghE1)
tmpchr <- sub("chr", "", cghE1$Chromosome)
chrom.numeric <- as.numeric(as.character(tmpchr))
chrom.numeric[tmpchr == "X"] <- 23
chrom.numeric[tmpchr == "Y"] <- 24
rm(tmpchr)
### we need the data ordered
reorder <- order(chrom.numeric,
                 cghE1$UG.Start,
                 cghE1$UG.End,
                 cghE1$Name)
cghE1 <- cghE1[reorder, ]
chrom.numeric <- chrom.numeric[reorder]

## run all methods
hmm.out <- pSegmentHMM(cghE1[, 5:7], chrom.numeric)
glad.out <- pSegmentGLAD(cghE1[, 5:7], chrom.numeric)
cghseg.out <- pSegmentCGHseg(cghE1[, 5:7], chrom.numeric, CGHseg.thres = -0.05)
ace.out <- pSegmentACE(cghE1[, 5:7], chrom.numeric)
wave.out <- pSegmentWavelets(cghE1[, 5:7], chrom.numeric)
wave.nm.out <- pSegmentWavelets(cghE1[, 5:7], chrom.numeric, mergeSegs = FALSE)
cbs.out <- pSegmentDNAcopy(cghE1[, 5:7], chrom.numeric)
cbs.ns.out <- pSegmentDNAcopy(cghE1[, 5:7], chrom.numeric, smooth = FALSE)
## Next onew we try, as it could fail when it should not.
try(psw.pos.out <- pSegmentPSW(cghE1[, 5:7], chrom.numeric, sign = 1))
try(psw.neg.out <- pSegmentPSW(cghE1[, 5:7], chrom.numeric, sign = -1))

## BioHMM is the only one that uses distances
## it is the slowest, so do only two
biohmm.out <- pSegmentBioHMM(cghE1[, 5:6], chrom.numeric, cghE1$UG.Start)

## With ACE we need to choose fdr
ace.out.sum <- summary(ace.out)

#### Writing
common <- cghE1[, 1:4]
writeResults(hmm.out, cghE1[, 5:7], common)
writeResults(glad.out, cghE1[, 5:7], common)        
writeResults(cghseg.out, cghE1[, 5:7], common)   
writeResults(ace.out.sum, cghE1[, 5:7], common)
writeResults(wave.out, cghE1[, 5:7], common)
writeResults(wave.nm.out, cghE1[, 5:7], common)
writeResults(cbs.out, cghE1[, 5:7], common)
try(writeResults(psw.pos.out, common))
try(writeResults(psw.neg.out, common))
writeResults(biohmm.out, cghE1[, 5:6], common)



### plot all

## Not run: 

## This will not work unless you have a working Python installation

segmentPlot(hmm.out, 
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

segmentPlot(glad.out,
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

segmentPlot(cghseg.out,
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

segmentPlot(wave.out, 
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

segmentPlot(wave.nm.out,
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

segmentPlot(cbs.out, 
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

segmentPlot(psw.pos.out,
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

segmentPlot(psw.neg.out,
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

segmentPlot(biohmm.out, 
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:6],
            idtype = "ug",
            organism = "Hs")
segmentPlot(ace.out.sum, 
            geneNames = cghE1[, 1],
            chrom.numeric = chrom.numeric,
            cghdata = cghE1[, 5:7],
            idtype = "ug",
            organism = "Hs")

## End(Not run)

## To verify we are doing the right thing with Picard et al.'s method,
## compare with their figure 1. Need to get the txt files with the
## original data.
## We got the data from the GLAD BioC package, and then exported the txt.
## For convenience, it is in the "/inst/txt-data" directory of our package.

## This verifies the output (look at ADaCGH.R for further details and other
## comparisons):

## coriel.data <- read.table("gm03563.txt", header = TRUE)
## cd3 <- coriel.data[coriel.data$Chromosome == 3, 3]
    
## out.our <- CGHsegWrapper(cd3, rep(1, length(cd3)), s = -0.5)  ## k = 2
## out.our

## so our implementation seems correct. Look at where the
## breakpoint is located, etc, and it is like figure 1 of
## Picard's paper.




[Package ADaCGH version 1.3-10 Index]