msc.features.select {caMassClass}R Documentation

Reduce Number of Features Prior to Classification

Description

Select subset of individual features that are potentially most useful for classification.

Usage

msc.features.select( x, y, RemCorrCol=0.98, KeepCol=0.6)

Arguments

x A matrix or data frame with training data. Rows contain samples and columns contain features/variables
y Class labels for the training data samples. A response vector with one label for each row/component of x. Can be either a factor, string or a numeric vector.
RemCorrCol If non-zero than some of the highly correlated columns are removed using msc.features.remove function with ccMin=RemCorrCol.
KeepCol If non-zero than columns with low AUC are removed.
  • if KeepCol smaller than 0.5 - do nothing
  • if KeepCol in between [0.5, 1] - keep columns with AUC bigger than KeepCol
  • if KeepCol bigger than one - keep top KeepCol number of columns

Details

This function reduces number of features in the data prior to classification, using following steps:

This function finds subset of individual features that are potentially most useful for classification, and each feature is rated individually. However, often set of two or more very poor individual features can produce a superior classifier. So, this function should be used with care. I found it very useful when classifying raw protein mass spectra (SELDI) data, for reducing dimensionality of the data from 10 000's to 100's prior of classification, instead of peak-finding (see msc.peaks.find).

Value

Vector of column indexes to be kept.

Author(s)

Jarek Tuszynski (SAIC) jaroslaw.w.tuszynski@saic.com

See Also

Examples

  # load "Data_IMAC.Rdata" file containing raw MS spectra 'X'  
  if (!file.exists("Data_IMAC.Rdata")) example("msc.project.read")
  load("Data_IMAC.Rdata")
  X = t(X[,,1])
  
  cidx = msc.features.select(X, SampleLabels, KeepCol=0.8)
  auc  = colAUC(X[,cidx], SampleLabels)
  cat(length(cidx),"features were selected out of",ncol(X),
      "; min(auc)=",min(auc),"; mean(auc)=",mean(auc),"\n")
  stopifnot( length(cidx)==612, min(auc)>0.8 )
  
  cidx = msc.features.select(X, SampleLabels, KeepCol=400)
  auc  = colAUC(X[,cidx], SampleLabels)
  cat(length(cidx),"features were selected out of",ncol(X),
      "; min(auc)=",min(auc),"; mean(auc)=",mean(auc),"\n")
  stopifnot( length(cidx)==400, min(auc)>0.8 )

[Package caMassClass version 1.6 Index]