NMF-utils {NMF} | R Documentation |
Define generic interface methods for class NMF
, which is
the base – virtual – class of the results from any NMF algorithms implemented
within package NMF's framework.
## S4 method for signature 'NMF': connectivity(x, ...) cophcor(object, ...) dispersion(object, ...) ## S4 method for signature 'NMF, factor': entropy(x, class, ...) ## S4 method for signature 'NMFfit': residuals(object, track=FALSE) rss(object, ...) ## S4 method for signature 'NMF': rss(object, target) ## S4 method for signature 'NMF': featureScore(object, method=c('kim', 'max')) ## S4 method for signature 'NMF': extractFeatures(object, method=c('kim', 'max'), format=c('list', 'combine', 'subset')) ## S4 method for signature 'NMF': metaHeatmap(object, what=c('samples', 'features'), filter=FALSE, ...) ## S4 method for signature 'NMF': nmfApply(object, MARGIN, FUN, ...) ## S4 method for signature 'NMF': predict(object, what = c('samples', 'features'), prob=FALSE) ## S4 method for signature 'NMF, factor': purity(x, class, ...) ## S4 method for signature 'NMF': sparseness(x, what = c('features', 'samples'), ...) syntheticNMF(n, r, p, offset=NULL, noise=FALSE, return.factors=FALSE)
class |
A factor giving a known class membership for each sample. |
filter |
if TRUE , only the features that are basis-specific are used.
Those features are those returned by function extractFeatures . |
format |
the output format of the extracted features.
Possible values are:
|
FUN |
the function to be applied: see 'Details'. In the case of
functions like + , %*% , etc., the function name must be
backquoted or quoted.
See link[base]{apply} for more details. |
MARGIN |
a vector giving the subscripts which the function will be
applied over. 1 indicates rows, 2 ' indicates columns,
c(1,2) indicates rows and columns.
See link[base]{apply} for more details.
|
method |
Method used to compute the feature scores and selecting the features.
Possible values are:
|
n |
Number of rows of the synthetic target matrix. |
noise |
if TRUE , a random noise is added the target matrix. |
object |
A matrix or an object that inherits from class
NMF or NMFfit – depending on the method. |
offset |
a vector giving the offset to add to the synthetic target matrix.
Its length should be equal to the number of rows n . |
prob |
Should the probability associated with each cluster prediction be computed and returned. |
p |
Number of columns of the synthetic target matrix. Not used if parameter
r is a vector (see description of argument r ). |
r |
Underlying factorization rank. If a single numeric is given,
the classes are randomly generated from a multinomial distribution.
If a numerical vector is given, then it should contain the counts in the different
classes (i.e integers). In such a case argument p is not used and the number of columns
is forced to be the sum of the counts. |
return.factors |
If TRUE , the underlying matrices W and
H are also returned. |
target |
the target object estimated by model object . It can be
a matrix or an ExpressionSet .
|
track |
if TRUE , the whole residuals track is returned.
Otherwise only the last residuals computed is returned. |
what |
Specifies on which matrix (basis components or mixture coefficients) the computation should be performed. |
x |
An object that inherits from class NMF . |
... |
Used to pass extra parameters to subsequent calls:
|
The connectivity matrix of a clustering is a matrix C containing only 0 or 1 entries such that:
C_{ij} = 1 if sample i belongs to the same cluster as sample j, 0 otherwise
object
,
generally obtained from multiple NMF runs.
The cophenetic correlation coeffificient is based on the consensus matrix (i.e. the average of connectivity matrices) and was proposed by Brunet et al. (2004) to measure the stability of the clusters obtained from NMF.
It is defined as the Pearson correlation between the samples' distances induced by the consensus matrix (seen as a similarity matrix) and their cophenetic distances from a hierachical clustering based on these very distances (by default an average linkage is used). See Brunet et al. (2004).
Note that argument ...
is not used.
object
,
generally obtained from multiple NMF runs.
The dispersion coeffificient is based on the consensus matrix (i.e. the average of connectivity matrices) and was proposed by Kim and Park (2007) to measure the reproducibility of the clusters obtained from NMF . It is defined as:
\rho = \sum_{i,j=1}^n 4 (C_{ij} - \frac{1}{2})^2 .
, where n is the total number of samples.
We have 0 \leq \rho \leq 1 and \rho = 1 only for a perfect consensus matrix, where all entries 0 or 1. A perfect consensus matrix is obtained only when all the connectivity matrices are the same, meaning that the algorithm gave the same clusters at each run. See Kim and Park (2007)
Note that argument ...
is not used.
Entropy = - \frac{1}{n \log_2 l} \sum_{q=1}^k \sum_{j=1}^l n_q^j \log_2 \frac{n_q^j}{n_q}
, where:
- n is the total number of samples;
- n is the total number of samples in cluster q;
- n_q^j is the number of samples in cluster q that belongs to original class j (1 \leq j \leq l).
The smaller the entropy, the better the clustering performance.
See Kim and Park (2007).
method
.
The score for feature i is defined as:
S_i = 1 + \frac{1}{\log_2 k} \sum_{q=1}^k p(i,q) \log_2 p(i,q),
where p(i,q) is the probability that the i-th feature contributes to basis q:
p(i,q) = \frac{W(i,q)}{\sum_{r=1}^k W(i,r)}
The feature scores are real values within the range [0,1]. The higher the feature score the more basis-specific the corresponding feature.
heatmap
-like custom function,
with parameters tuned for displaying NMF results.
The used to draw the heatmap is a mixture of the function heatmap.2
from the gplots
package, and the function heatmap.plus
from the heatmap.plus
package. It allows to add extra annotation rows
using the ColSideColor
argument.
See heatmap.2
and heatmap.plus
.
apply
-like method for objects of class NMF
.
When argument MARGIN=1
, it calls the base method apply
to apply
function FUN
to the rows of the basis component matrix.
When MARGIN=2
, it calls the base method apply
to apply
function FUN
on the columns of the mixture coefficient matrix.
See apply
for more details on the output format.
When what='samples'
the computation is performed on the mixture
coefficient matrix, or on the transposed basis matrix when what='features'
.
For each column, the dominant basis component is computed as the row index for which the entry is the maximum within the column.
If argument prob=FALSE
(default), the result is a factor
.
Otherwise it returns a list with two elements: element predict
contains
the computed indexes ( as a factor
) and element prob
contains
the vector of the associated probabilities, that is the relative contribution
of the maximum entry within each column.
The purity is a measure of performance of a clustering method, in recovering the classes defined by a factor a priori known (i.e. one knows the true class labels). Suppose we are given l categories, while the clustering method generates k clusters. Purity is given by:
Purity = \frac{1}{n} \sum_{q=1}^k \max_{1 \leq j \leq l} n_q^j
, where:
- n is the total number of samples;
- n_q^j is the number of samples in cluster q that belongs to original class j (1 \leq j \leq l).
The purity is therefore a real number in [0,1]. The larger the purity, the better the clustering performance.
See Kim and Park (2007).
object
. They are computed using the objective function
associated to the NMF algorithm that returned object
.
When called with track=TRUE
, the whole residuals track is returned,
if available. Note that method nmf
does not compute the residuals track,
unless explicitly required.
It is a S4 methods defined for the associated generic functions from package
stats
(See residuals)
target
and its estimation by the object
. Hutchins et al. (2008) used
the variation of the RSS in combination with Lee and Seung's algorithm
to estimate the correct number of basis vectors. The optimal rank is chosen
where the graph of the RSS first shows an inflexion
point. See references.
Note that this way of estimation may not be suitable for all models. Indeed, if the NMF optimization problem is not based on the Frobenius norm, the RSS is not directly linked to the quality of approximation of the NMF model.
This sparseness measure quantifies how much energy of a vector is packed into only few components. It is defined by:
Sparseness(x) = \frac{\sqrt{n} - \frac{\sum |x_i|}{\sqrt{\sum x_i^2}}}{\sqrt{n}-1}
, where n is the length of x
.
The sparseness is a real number in [0,1]. It is equal to 1 if and only if x
contains
a single nonzero component, and is equal to 0 if and only if all components of x
are equal.
It interpolates smoothly between these two extreme values.
The closer to 1 is the sparseness the sparser is the vector.
Renaud Gaujoux renaud@cbio.uct.ac.za
Metagenes and molecular pattern discovery using matrix factorization Brunet, J.~P., Tamayo, P., Golub, T.~R., and Mesirov, J.~P. (2004) Proc Natl Acad Sci U S A 101(12), 4164–4169.
Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis Kim, H. & Park, H. (2007) Bioinformatics. http://dx.doi.org/10.1093/bioinformatics/btm134.
Non-negative Matrix Factorization with Sparseness Constraints Hoyer, P. O. (2004) Journal of Machine Learning Research 5 (2004) 1457–1469
Biclustering of gene expression data by non-smooth non-negative matrix factorization Carmona-Saez, Pedro and Pascual-Marqui, Roberto and Tirado, F and Carazo, Jose and Pascual-Montano, Alberto (2006) BMC Bioinformatics 7(1), 78
# generate a synthetic dataset with known classes: 50 features, 18 samples (5+5+8) n <- 50; counts <- c(5, 5, 8); V <- syntheticNMF(n, counts, noise=TRUE) ## Not run: metaHeatmap(V) # build the class factor groups <- as.factor(do.call('c', lapply(seq(3), function(x) rep(x, counts[x])))) # perform default NMF res <- nmf(V, 2) res ## Not run: metaHeatmap(res, class=groups) ## Not run: metaHeatmap(res, 'features') # see the predicted clusters of samples predict(res) # compute entropy and purity entropy(res, class=groups) purity(res, class=groups) # perform NMF with the right number of basis components res <- nmf(V, 3) ## Not run: metaHeatmap(res) ## Not run: metaHeatmap(res, 'features') entropy(res, class=groups) purity(res, class=groups)