seriate.dist {seriation} | R Documentation |
Unidimensional seriation ties to arrange objects in linear order given available data.
Given a dissimilarity matrix seriation typically tries to move small dissimilarity values as close as possible to the diagonal of the symmetric dissimilarity matrix or tries to minimize the dissimilarities between neighboring objects.
## S3 method for class 'dist': seriate(x, method = NULL, control = NULL, ...)
x |
an object of class dist . |
method |
a character string with the name of the seriation method
(default: "TSP" ). |
control |
a list of control options passed on to the seriation algorithms. |
... |
further arguments (unused). |
Currently the following methods are implemented:
"ARSA"
A heuristic developed by Brusco et al (2007).
"BBURCG"
A branch-and-bound implementation by Brusco and Stahl (2005).
"BBWRCG"
A branch-and-bound implementation by Brusco and Stahl (2005).
"TSP"
A solver in TSP can be used (see solve_TSP
in
package TSP). The solver method can be passed on via the
control
argument, e.g. control = list(method = "insertion")
.
Since a tour returned by a TSP solver is a connected circle and we are looking for a path representing a linear order, we need to find the best cutting point. Climer and Zhang (2006) suggest to add a dummy city with equal distance to each other city before generating the tour. The place of this dummy city in an optimal tour with minimal length is the best cutting point (it lies between the most distant cities).
"Chen"
This method starts with generating a sequence of correlation matrices
R^1, R^2, .... R^1 is the correlation matrix
of the original distance matrix D (supplied to the function as
x
),
and
R^{n+1} = phi R^n,
where phi calculates the correlation matrix.
The rank of the matrix R^n falls with increasing n. The first R^n in the sequence which has a rank of 2 is found. Projecting all points in this matrix on the first two eigenvectors, all points fall on an ellipse. The order of the points on this ellipse is the resulting order.
The ellipse can be cut at the two interception points (top or bottom) of the vertical axis with the ellipse. In this implementation the top most cutting point is used.
"MDS"
Use multidimensional scaling techniques to find an linear order. Note that unidimensional scaling would be more appropriate but is very hard to solve. Generally, MDS provides good results.
By default, metric MDS (cmdscale
in stats) is used.
In case of of general dissimilarities, non-metric MDS can be used.
The choices are isoMDS
and sammon
from MASS.
The method can be specified as the element method
("cmdscale"
, "isoMDS"
or "sammon"
) in control
.
"HC"
Using the order of the leaf nodes in a dendrogram obtained by hierarchical
clustering can be used as a very simple seriation technique.
This method applies hierarchical clustering (hclust
) to x
.
The clustering method can be given using a "method"
element in
the control
list. If omitted, the default "complete"
is
used.
"GW"
, "OLO"
Uses also the order of the leaf nodes in a dendrogram (see method
"HC"
), however, the leaf notes are reordered.
A dendrogram (binary tree) has 2^{n-1} internal nodes (subtrees) and the same number of leaf orderings. That is, at each internal node the left and right subtree (or leaves) can be swapped, or, in terms of a dendrogram, be flipped.
Method "GW"
uses an algorithm developed by Gruvaeus and Wainer
(1972) and implemented in package gclus (Hurley 2004). The clusters are
ordered at each level so that the objects at the edge of each cluster are
adjacent to that object outside the cluster to which it is nearest. The method
produces an unique order.
Method "OLO"
(Optimal leaf ordering, Bar-Joseph et al., 2001)
produces an optimal leaf ordering with respect to the
minimizing the sum of the distances along the (Hamiltonian) path connecting the
leaves in the given order. The time complexity of the algorithm is O(n^3).
Note that non-finite distance values are not allowed.
Both methods start with a dendrogram created by hclust
. As the
"method"
element in the control
list a clustering method (default
"complete"
) can be specified. Alternatively, a hclust
object can
be supplied using an element named "hclust"
.
Returns the order as an object of class ser_permutation
.
Z. Bar-Joseph, E. D. Demaine, D. K. Gifford, and T. Jaakkola. (2001): Fast Optimal Leaf Ordering for Hierarchical Clustering. Bioinformatics, 17(1), 22–29.
Brusco, M., Koehn, H.F., and Stahl, S. (2007): Heuristic Implementation of Dynamic Programming for Matrix Permutation Problems in Combinatorial Data Analysis. Psychometrika, conditionally accepted.
Brusco, M., and Stahl, S. (2005): Branch-and-Bound Applications in Combinatorial Data Analysis. New York: Springer.
Chen, C. H. (2002): Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices. Statistica Sinica, 12(1), 7–29.
Gruvaeus, G. and Wainer, H. (1972): Two Additions to Hierarchical Cluster Analysis, British Journal of Mathematical and Statistical Psychology, 25, 200–206.
Hurley, Catherine B. (2004): Clustering Visualizations of Multidimensional Data. Journal of Computational and Graphical Statistics, 13(4), 788–806.
Sharlee Climer, Weixiong Zhang (2006): Rearrangement Clustering: Pitfalls, Remedies, and Applications, Journal of Machine Learning Research, 7(Jun), 919–943.
solve_TSP
in TSP,
hclust
in stats,
criterion
,
seriate.matrix
.
data("iris") x <- as.matrix(iris[-5]) x <- x[sample(1:nrow(x)),] d <- dist(x) ## default seriation order <- seriate(d) order ## plot def.par <- par(no.readonly = TRUE) layout(cbind(1,2), respect = TRUE) pimage(d, main = "Random") pimage(d, order, main = "Reordered") par(def.par) ## compare quality rbind( random = criterion(d), reordered = criterion(d, order) )