simulator {boost}R Documentation

simulator

Description

Simulation of (microarray) data according to correlation and mean structures from real datasets.

Usage

simulator(x, y, respmod = c("none", "resp1", "resp2", "resp3"),
nos = 1200, gene = NULL, signs = NULL)

Arguments

x A (n x p)-matrix, whose correlation and mean structure is to be used for simulating data. Its rows correspond to training instances and columns contain the predictor variables.
y A vector of length n containing the class labels, which need to be coded by 0 and 1.
respmod A character string. Either "none" where the simulated gene expression labels are determined model-free depending which class mean and correlation structure had been used for their determination. The choice of "resp1", "resp2" and "resp3" means that a response model is applied. For "resp1", 10 genes are selected and determine conditional proabilities via a logistic model with equal weights. The class labels are then regarded as having a Bernoulli distribution with probability p. For "resp2", 25 genes are plugged into the logistic model with non-equal weights. With "resp3", 25 genes are chosen for a logistic model with second and third order interactions.
nos An integer, giving the number of instances which are simulated.
gene A vector giving the index of the genes which shall be used for model based class label simulation. Defaults to NULL. This argument should only be used for specially designed simulation studies, where it is important that the same predictor variables are repeatedly used for simulating class label.
signs A vector containing entries of +1 and -1. Defaults to NULL and is only of importance in specially designed simulation studies, where it is important that the same predictor variables are repeatedly used for simulating class label.

Details

The new instances are simulated according to a multivariate normal distribution with means and correlation structure taken from a real (gene expression) dataset. This structure is obtained by transforming a standard multivariate normal distribution, which requires a eigenvalue decomposition of the provided real dataset. For datasets with many predictors (>500), this can be fairly time consuming. Simulating data without applying a response model is fine for most purposes, only special analysis tasks usually require it.

Value

Returns a list containing

x An (nos x p)-matrix, containing the simulated data
y A vector of length nos, containing the class labels of the simulated data.
probab A vector of length nos, containing the conditional probabilities of the simulated data. Is empty if respmod="none".
bayes An integer, giving the Bayes error (theoretically minimal misclassification risk) for the simulated data. Is empty if respmod="none".
gene A vector, containing the indices of the variables which had been used in the logistic model for either "resp1", "resp2" or "resp3". Is empty if respmod="none".
signs A vector, containing -1 and +1. Indicates with what polarization a predictor variable had been used in the logistic model. Is empty if respmod="none".

b

References

o
"BagBoosting for Tumor Classification with Gene Expression Data", Marcel Dettling. To appear in Bioinformatics (2005).
o
Further information is available from the webpage http://stat.ethz.ch/~dettling

Examples

set.seed(21)
data(leukemia)

## Simulation of gene expression data
simu <- simulator(leukemia.x, leukemia.y, nos=200)

## Defining training and test data
xlearn <- simu$x[1:150,]
ylearn <- simu$y[1:150]
xtest  <- simu$x[151:200,]
ytest  <- simu$y[151:200]

## Classification with logitboost
fit <- logitboost(xlearn, ylearn, xtest, mfinal=20, presel=50)
summarize(fit, ytest)

[Package boost version 1.0-0 Index]