seriate.dist {seriation}R Documentation

Seriation of objects in a dissimilarity matrix

Description

Unidimensional seriation ties to arrange objects in linear order given available data.

Given a dissimilarity matrix seriation typically tries to move small dissimilarity values as close as possible to the diagonal of the symmetric dissimilarity matrix or tries to minimize the dissimilarities between neighboring objects.

Usage

## S3 method for class 'dist':
seriate(x, method = NULL, control = NULL, ...)

Arguments

x an object of class dist.
method a character string with the name of the seriation method (default: "TSP").
control a list of control options passed on to the seriation algorithms.
... further arguments (unused).

Details

Currently the following methods are implemented:

"ARSA"
Anti-Robinson seriation by simulated annealing.

A heuristic developed by Brusco et al (2007).

"BBURCG"
Anti-Robinson seriation (unweighted)

A branch-and-bound implementation by Brusco and Stahl (2005).

"BBWRCG"
Anti-Robinson seriation (weighted)

A branch-and-bound implementation by Brusco and Stahl (2005).

"TSP"
Traveling salesperson problem solver.

A solver in TSP can be used (see solve_TSP in package TSP). The solver method can be passed on via the control argument, e.g. control = list(method = "insertion").

Since a tour returned by a TSP solver is a connected circle and we are looking for a path representing a linear order, we need to find the best cutting point. Climer and Zhang (2006) suggest to add a dummy city with equal distance to each other city before generating the tour. The place of this dummy city in an optimal tour with minimal length is the best cutting point (it lies between the most distant cities).

"Chen"
Rank-two ellipse seriation (Chen 2002).

This method starts with generating a sequence of correlation matrices R^1, R^2, .... R^1 is the correlation matrix of the original distance matrix D (supplied to the function as x), and

R^{n+1} = phi R^n,

where phi calculates the correlation matrix.

The rank of the matrix R^n falls with increasing n. The first R^n in the sequence which has a rank of 2 is found. Projecting all points in this matrix on the first two eigenvectors, all points fall on an ellipse. The order of the points on this ellipse is the resulting order.

The ellipse can be cut at the two interception points (top or bottom) of the vertical axis with the ellipse. In this implementation the top most cutting point is used.

"MDS"
Multidimensional scaling (MDS).

Use multidimensional scaling techniques to find an linear order. Note that unidimensional scaling would be more appropriate but is very hard to solve. Generally, MDS provides good results.

By default, metric MDS (cmdscale in stats) is used. In case of of general dissimilarities, non-metric MDS can be used. The choices are isoMDS and sammon from MASS. The method can be specified as the element method ("cmdscale", "isoMDS" or "sammon") in control.

"HC"
Hierarchical clustering.

Using the order of the leaf nodes in a dendrogram obtained by hierarchical clustering can be used as a very simple seriation technique. This method applies hierarchical clustering (hclust) to x. The clustering method can be given using a "method" element in the control list. If omitted, the default "complete" is used.

"GW", "OLO"
Hierarchical clustering with optional reordering.

Uses also the order of the leaf nodes in a dendrogram (see method "HC"), however, the leaf notes are reordered.

A dendrogram (binary tree) has 2^{n-1} internal nodes (subtrees) and the same number of leaf orderings. That is, at each internal node the left and right subtree (or leaves) can be swapped, or, in terms of a dendrogram, be flipped.

Method "GW" uses an algorithm developed by Gruvaeus and Wainer (1972) and implemented in package gclus (Hurley 2004). The clusters are ordered at each level so that the objects at the edge of each cluster are adjacent to that object outside the cluster to which it is nearest. The method produces an unique order.

Method "OLO" (Optimal leaf ordering, Bar-Joseph et al., 2001) produces an optimal leaf ordering with respect to the minimizing the sum of the distances along the (Hamiltonian) path connecting the leaves in the given order. The time complexity of the algorithm is O(n^3). Note that non-finite distance values are not allowed.

Both methods start with a dendrogram created by hclust. As the "method" element in the control list a clustering method (default "complete") can be specified. Alternatively, a hclust object can be supplied using an element named "hclust".

Value

Returns the order as an object of class ser_permutation.

References

Z. Bar-Joseph, E. D. Demaine, D. K. Gifford, and T. Jaakkola. (2001): Fast Optimal Leaf Ordering for Hierarchical Clustering. Bioinformatics, 17(1), 22–29.

Brusco, M., Koehn, H.F., and Stahl, S. (2007): Heuristic Implementation of Dynamic Programming for Matrix Permutation Problems in Combinatorial Data Analysis. Psychometrika, conditionally accepted.

Brusco, M., and Stahl, S. (2005): Branch-and-Bound Applications in Combinatorial Data Analysis. New York: Springer.

Chen, C. H. (2002): Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices. Statistica Sinica, 12(1), 7–29.

Gruvaeus, G. and Wainer, H. (1972): Two Additions to Hierarchical Cluster Analysis, British Journal of Mathematical and Statistical Psychology, 25, 200–206.

Hurley, Catherine B. (2004): Clustering Visualizations of Multidimensional Data. Journal of Computational and Graphical Statistics, 13(4), 788–806.

Sharlee Climer, Weixiong Zhang (2006): Rearrangement Clustering: Pitfalls, Remedies, and Applications, Journal of Machine Learning Research, 7(Jun), 919–943.

See Also

solve_TSP in TSP, hclust in stats, criterion, seriate.matrix.

Examples

data("iris")
x <- as.matrix(iris[-5])
x <- x[sample(1:nrow(x)),]
d <- dist(x)

## default seriation
order <- seriate(d)
order

## plot
def.par <- par(no.readonly = TRUE)
layout(cbind(1,2), respect = TRUE)

pimage(d, main = "Random")
pimage(d, order, main = "Reordered")

par(def.par)

## compare quality
rbind(
    random = criterion(d),
    reordered = criterion(d, order)
)


[Package seriation version 0.1-1 Index]