sim.data {penalizedSVM}R Documentation

Simulation of microarray data

Description

Simulation of 'n' samples. Each sample has 'sg' genes, only 'nsg' of them are called significant and have influence on class labels. All other '(ng - nsg)' genes are called ballanced. All gene ratios are drawn from a multivariate normal distribution. There is a posibility to create blocks of highly correlated genes.

Usage

sim.data(n = 256, ng = 1000, nsg = 100,
                 p.n.ratio = 0.5, 
                 sg.pos.factor= 1, sg.neg.factor= -1,
                 # correlation info:
                 corr = FALSE, corr.factor = 0.8,
                 # block info:
                 blocks = FALSE, n.blocks = 6, nsg.block = 1, ng.block = 5, 
                 seed = 12345, ...)

Arguments

n number of samples, logistic regression works well if n>200!
ng number of genes
nsg number of significant genes
p.n.ratio ratio between positive and negative significant genes (default 0.5)
sg.pos.factor impact factor of _positive_ significant genes on the classifaction, default: 1
sg.neg.factor impact factor of _negative_ significant genes on the classifaction,default: -1
corr are the genes correalted to each other? (default FALSE). see Details
corr.factor correlation factorfor genes, between 0 and 1 (default 0.8)
blocks are blocks of highly correlated genes are allowed? (default FALSE)
n.blocks number of blocks
nsg.block number of significant genes per block
ng.block number of genes per block
seed seed
... additional argument(s)

Details

If no blockes (n.blocks=0 or blocks=FALSE) are defined and corr=TRUE create covarance matrix for all genes! with decrease of correlation : cov(i,j)=cov(j,i)= corr.factor^(i-j)

Value

x matrix of simulated data. Genes in rows and samples in columns
y named vector of class labels
seed seed

Author(s)

Wiebke Werft, Natalia Becker

See Also

mvrnorm

Examples


my.seed<-123

# 1. simulate 20 samples, with 100 genes in each. Only the first two genes have an impact on the class labels.
# All genes are assumed to be i.i.d. 
train<-sim.data(n = 20, ng = 100, nsg = 3, corr=FALSE, seed=my.seed )
print(str(train)) 

# 2. change the proportion between positive and negative significant genes (from 0.5 to 0.8)
train<-sim.data(n = 20, ng = 100, nsg = 10, p.n.ratio = 0.8,  seed=my.seed )
rownames(train$x)[1:15]
# [1] "pos1" "pos2" "pos3" "pos4" "pos5" "pos6" "pos7" "pos8" 
# [2] "neg1" "neg2" "bal1" "bal2" "bal3" "bal4" "bal5"

# 3. assume to have correlation for positive significant genes,  negative significant genes and 'balanced' genes separatly. 
train<-sim.data(n = 20, ng = 100, nsg = 10, corr=TRUE, seed=my.seed )
cor(t(train$x[1:15,]))

# 4. add 6 blocks of 5 genes each and only one significant gene per block.
# all genes in the block are correlated with constant correlation factor  corr.factor=0.8               
train<-sim.data(n = 20, ng = 100, nsg = 6, corr=TRUE, corr.factor=0.8,
                         blocks=TRUE, n.blocks=6, nsg.block=1, ng.block=5, seed=my.seed )
print(str(train)) 
# first block
cor(t(train$x[1:5,]))
# second block
cor(t(train$x[6:10,]))


[Package penalizedSVM version 1.0 Index]