msc.features.select {caMassClass} | R Documentation |
Select subset of individual features that are potentially most useful for classification.
msc.features.select( x, y, RemCorrCol=0.98, KeepCol=0.6)
x |
A matrix or data frame with training data. Rows contain samples and columns contain features/variables |
y |
Class labels for the training data samples.
A response vector with one label for each row/component of x .
Can be either a factor, string or a numeric vector. |
RemCorrCol |
If non-zero than some of the highly correlated columns are
removed using msc.features.remove function with
ccMin=RemCorrCol . |
KeepCol |
If non-zero than columns with low AUC are removed.
|
This function reduces number of features in the data prior to classification, using following steps:
colAUC
msc.features.remove
function.
This function finds subset of individual features that are potentially most
useful for classification, and each feature is rated individually.
However, often set of two or more
very poor individual features can produce a superior classifier. So, this
function should be used with care. I found it very useful when classifying
raw protein mass spectra (SELDI) data, for reducing dimensionality of the
data from 10 000's to 100's prior of classification, instead of peak-finding
(see msc.peaks.find
).
Vector of column indexes to be kept.
Jarek Tuszynski (SAIC) jaroslaw.w.tuszynski@saic.com
msc.classifier.test
function.
colAUC
, msc.features.remove
and
msc.features.scale
functions.
# load "Data_IMAC.Rdata" file containing raw MS spectra 'X' if (!file.exists("Data_IMAC.Rdata")) example("msc.project.read") load("Data_IMAC.Rdata") X = t(X[,,1]) cidx = msc.features.select(X, SampleLabels, KeepCol=0.8) auc = colAUC(X[,cidx], SampleLabels) cat(length(cidx),"features were selected out of",ncol(X), "; min(auc)=",min(auc),"; mean(auc)=",mean(auc),"\n") stopifnot( length(cidx)==612, min(auc)>0.8 ) cidx = msc.features.select(X, SampleLabels, KeepCol=400) auc = colAUC(X[,cidx], SampleLabels) cat(length(cidx),"features were selected out of",ncol(X), "; min(auc)=",min(auc),"; mean(auc)=",mean(auc),"\n") stopifnot( length(cidx)==400, min(auc)>0.8 )