NMF-utils {NMF}R Documentation

Class and Utility Methods for NMF objects

Description

Define generic interface methods for class NMF, which is the base – virtual – class of the results from any NMF algorithms implemented within package NMF's framework.

Usage


## S4 method for signature 'NMF':
connectivity(x, ...)

cophcor(object, ...)

dispersion(object, ...)

## S4 method for signature 'NMF, factor':
entropy(x, class, ...)

## S4 method for signature 'NMFfit':
residuals(object, track=FALSE)

rss(object, ...)
## S4 method for signature 'NMF':
rss(object, target)

## S4 method for signature 'NMF':
featureScore(object, method=c('kim', 'max'))

## S4 method for signature 'NMF':
extractFeatures(object, method=c('kim', 'max'), format=c('list', 'combine', 'subset'))

## S4 method for signature 'NMF':
metaHeatmap(object, what=c('samples', 'features'), filter=FALSE, ...)

## S4 method for signature 'NMF':
nmfApply(object, MARGIN, FUN, ...)

## S4 method for signature 'NMF':
predict(object, what = c('samples', 'features'), prob=FALSE)

## S4 method for signature 'NMF, factor':
purity(x, class, ...)

## S4 method for signature 'NMF':
sparseness(x, what = c('features', 'samples'), ...)

syntheticNMF(n, r, p, offset=NULL, noise=FALSE, return.factors=FALSE)

Arguments

class A factor giving a known class membership for each sample.
filter if TRUE, only the features that are basis-specific are used. Those features are those returned by function extractFeatures.
format the output format of the extracted features. Possible values are:

  • list (default) a list with one element per basis vector, each containing the indices of the basis-specific features.
  • combine a single integer vector containing the indices of the basis-specific features for ALL the basis.
  • subset the object object subset to contain only the basis-specific features.

FUN the function to be applied: see 'Details'. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted. See link[base]{apply} for more details.
MARGIN a vector giving the subscripts which the function will be applied over. 1 indicates rows, 2' indicates columns, c(1,2) indicates rows and columns. See link[base]{apply} for more details.
method Method used to compute the feature scores and selecting the features.
Possible values are:

  • kim (default) to use Kim and Park (2007) scoring schema and feature selection method. The features are first scored using the function featureScore. Then only the features that fulfil both following criteria are retained:
    - score greater than \hat{\mu} + 3 \hat{\sigma}, where \hat{\mu} and \hat{\sigma} are the median and the median absolute deviation (MAD) of the scores respectively;
    - the maximum contribution to a basis component is greater than the median of all contributions (i.e. of all elements of W)
    See Kim and Park (2007).
  • max where the score is the maximum contribution of each feature to the basis vectors and the selection method is the one described in Carmona-Saez (2006). Briefly, for each basis vector, the features are first sorted in descending order by their contribution to the basis vector. Then, one selects only the first consecutive features from the sorted list whose highest contribution in the basis matrix is found in the considered basis (see section References).
n Number of rows of the synthetic target matrix.
noise if TRUE, a random noise is added the target matrix.
object A matrix or an object that inherits from class NMF or NMFfit – depending on the method.
offset a vector giving the offset to add to the synthetic target matrix. Its length should be equal to the number of rows n.
prob Should the probability associated with each cluster prediction be computed and returned.
p Number of columns of the synthetic target matrix. Not used if parameter r is a vector (see description of argument r).
r Underlying factorization rank. If a single numeric is given, the classes are randomly generated from a multinomial distribution. If a numerical vector is given, then it should contain the counts in the different classes (i.e integers). In such a case argument p is not used and the number of columns is forced to be the sum of the counts.
return.factors If TRUE, the underlying matrices W and H are also returned.
target the target object estimated by model object. It can be a matrix or an ExpressionSet.
track if TRUE, the whole residuals track is returned. Otherwise only the last residuals computed is returned.
what Specifies on which matrix (basis components or mixture coefficients) the computation should be performed.
x An object that inherits from class NMF.
... Used to pass extra parameters to subsequent calls:
  • in metaHeatmap: Graphical parameters passed to function heatmap.2
  • in nmfApply: optional arguments to function FUN.

Details

connectivity
Computes the connectivity matrix for the samples based on their mixture coefficients.

The connectivity matrix of a clustering is a matrix C containing only 0 or 1 entries such that:

C_{ij} = 1 if sample i belongs to the same cluster as sample j, 0 otherwise

cophcor
Computes the cophenetic correlation coefficient of consensus matrix object, generally obtained from multiple NMF runs.

The cophenetic correlation coeffificient is based on the consensus matrix (i.e. the average of connectivity matrices) and was proposed by Brunet et al. (2004) to measure the stability of the clusters obtained from NMF.

It is defined as the Pearson correlation between the samples' distances induced by the consensus matrix (seen as a similarity matrix) and their cophenetic distances from a hierachical clustering based on these very distances (by default an average linkage is used). See Brunet et al. (2004).

Note that argument ... is not used.

dispersion
Computes the dispersion coefficient of consensus matrix object, generally obtained from multiple NMF runs.

The dispersion coeffificient is based on the consensus matrix (i.e. the average of connectivity matrices) and was proposed by Kim and Park (2007) to measure the reproducibility of the clusters obtained from NMF . It is defined as:

\rho = \sum_{i,j=1}^n 4 (C_{ij} - \frac{1}{2})^2 .

, where n is the total number of samples.

We have 0 \leq \rho \leq 1 and \rho = 1 only for a perfect consensus matrix, where all entries 0 or 1. A perfect consensus matrix is obtained only when all the connectivity matrices are the same, meaning that the algorithm gave the same clusters at each run. See Kim and Park (2007)

Note that argument ... is not used.

entropy
The entropy is a measure of performance of a clustering method, in recovering classes defined by factor a priori known (i.e. one knows the true class labels). Suppose we are given l categories, while the clustering method generates k clusters. Entropy is given by:

Entropy = - \frac{1}{n \log_2 l} \sum_{q=1}^k \sum_{j=1}^l n_q^j \log_2 \frac{n_q^j}{n_q}

, where:

- n is the total number of samples;

- n is the total number of samples in cluster q;

- n_q^j is the number of samples in cluster q that belongs to original class j (1 \leq j \leq l).

The smaller the entropy, the better the clustering performance.

See Kim and Park (2007).

extractFeatures
Identify the most basis-specific features, using different methods. See details of argument method.
featureScore
Computes the feature scores as suggested in Kim and Park (2007).

The score for feature i is defined as:

S_i = 1 + \frac{1}{\log_2 k} \sum_{q=1}^k p(i,q) \log_2 p(i,q),

where p(i,q) is the probability that the i-th feature contributes to basis q:

p(i,q) = \frac{W(i,q)}{\sum_{r=1}^k W(i,r)}

The feature scores are real values within the range [0,1]. The higher the feature score the more basis-specific the corresponding feature.

metaHeatmap
Produces a heatmap of the basis components or mixture coefficients using a heatmap-like custom function, with parameters tuned for displaying NMF results.

The used to draw the heatmap is a mixture of the function heatmap.2 from the gplots package, and the function heatmap.plus from the heatmap.plus package. It allows to add extra annotation rows using the ColSideColor argument. See heatmap.2 and heatmap.plus.

nmfApply
apply-like method for objects of class NMF.

When argument MARGIN=1, it calls the base method apply to apply function FUN to the rows of the basis component matrix.

When MARGIN=2, it calls the base method apply to apply function FUN on the columns of the mixture coefficient matrix.

See apply for more details on the output format.

predict
Computes the dominant basis component for each sample (resp. feature) based on its associated entries in the mixture coefficient matrix (i.e in H) (resp. basis component matrix (i.e in W)).

When what='samples' the computation is performed on the mixture coefficient matrix, or on the transposed basis matrix when what='features'.

For each column, the dominant basis component is computed as the row index for which the entry is the maximum within the column.

If argument prob=FALSE (default), the result is a factor. Otherwise it returns a list with two elements: element predict contains the computed indexes ( as a factor) and element prob contains the vector of the associated probabilities, that is the relative contribution of the maximum entry within each column.

purity
Computes the purity of a clustering given a known factor.

The purity is a measure of performance of a clustering method, in recovering the classes defined by a factor a priori known (i.e. one knows the true class labels). Suppose we are given l categories, while the clustering method generates k clusters. Purity is given by:

Purity = \frac{1}{n} \sum_{q=1}^k \max_{1 \leq j \leq l} n_q^j

, where:

- n is the total number of samples;

- n_q^j is the number of samples in cluster q that belongs to original class j (1 \leq j \leq l).

The purity is therefore a real number in [0,1]. The larger the purity, the better the clustering performance.

See Kim and Park (2007).

residuals
returns the – final – residuals between the target matrix and the NMF result object. They are computed using the objective function associated to the NMF algorithm that returned object. When called with track=TRUE, the whole residuals track is returned, if available. Note that method nmf does not compute the residuals track, unless explicitly required.

It is a S4 methods defined for the associated generic functions from package stats (See residuals)

See nmf and NMFfit.

rss
returns the Residual Sum of Squares (RSS) between the target object target and its estimation by the object. Hutchins et al. (2008) used the variation of the RSS in combination with Lee and Seung's algorithm to estimate the correct number of basis vectors. The optimal rank is chosen where the graph of the RSS first shows an inflexion point. See references.

Note that this way of estimation may not be suitable for all models. Indeed, if the NMF optimization problem is not based on the Frobenius norm, the RSS is not directly linked to the quality of approximation of the NMF model.

sparseness
Computes the sparseness of a vector, matrix as defined in Hoyer (2004).

This sparseness measure quantifies how much energy of a vector is packed into only few components. It is defined by:

Sparseness(x) = \frac{\sqrt{n} - \frac{\sum |x_i|}{\sqrt{\sum x_i^2}}}{\sqrt{n}-1}

, where n is the length of x.

The sparseness is a real number in [0,1]. It is equal to 1 if and only if x contains a single nonzero component, and is equal to 0 if and only if all components of x are equal. It interpolates smoothly between these two extreme values. The closer to 1 is the sparseness the sparser is the vector.

syntheticNMF
Generate a synthetic matrix according to an underlying NMF model. It can be used to quickly test NMF algorithms.

Author(s)

Renaud Gaujoux renaud@cbio.uct.ac.za

References

Metagenes and molecular pattern discovery using matrix factorization Brunet, J.~P., Tamayo, P., Golub, T.~R., and Mesirov, J.~P. (2004) Proc Natl Acad Sci U S A 101(12), 4164–4169.

Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis Kim, H. & Park, H. (2007) Bioinformatics. http://dx.doi.org/10.1093/bioinformatics/btm134.

Non-negative Matrix Factorization with Sparseness Constraints Hoyer, P. O. (2004) Journal of Machine Learning Research 5 (2004) 1457–1469

Biclustering of gene expression data by non-smooth non-negative matrix factorization Carmona-Saez, Pedro and Pascual-Marqui, Roberto and Tirado, F and Carazo, Jose and Pascual-Montano, Alberto (2006) BMC Bioinformatics 7(1), 78

See Also

NMF, summary

Examples


# generate a synthetic dataset with known classes: 50 features, 18 samples (5+5+8)
n <- 50; counts <- c(5, 5, 8);
V <- syntheticNMF(n, counts, noise=TRUE)
## Not run: metaHeatmap(V)

# build the class factor
groups <- as.factor(do.call('c', lapply(seq(3), function(x) rep(x, counts[x]))))

# perform default NMF
res <- nmf(V, 2)
res

## Not run: metaHeatmap(res, class=groups)
## Not run: metaHeatmap(res, 'features')
# see the predicted clusters of samples
predict(res)
# compute entropy and purity
entropy(res, class=groups)
purity(res, class=groups)

# perform NMF with the right number of basis components
res <- nmf(V, 3)

## Not run: metaHeatmap(res)
## Not run: metaHeatmap(res, 'features')
entropy(res, class=groups)
purity(res, class=groups)


[Package NMF version 0.2.4 Index]