ls_fit_ultrametric {clue}R Documentation

Least Squares Fit of Ultrametrics to Dissimilarities

Description

Find the ultrametric minimizing least squares distance (Euclidean dissimilarity) to a given dissimilarity object.

Usage

ls_fit_ultrametric(x, control = list())

Arguments

x a dissimilarity object inheriting from class "dist".
control a list of control parameters. See Details.

Details

With L(u) = sum (x - u)^2, the problem to be solved is minimizing L over all u satisfying the ultrametric constraints (i.e., for all i, j, k, u_{ij} <= max(u_{ik}, u_{jk})). This problem is known to be NP hard (Krivanek and Moravek, 1986).

We follow de Soete (1986) to use a heuristic based on an SUMT (Sequential Unconstrained Minimization Technique) approach in turn simplifying the suggestions in Carroll and Pruzansky (1980). One iteratively minimizes L(u) + rho_k P(u), where P(u) is a non-negative function penalizing violations of the ultrametric constraints such that P(u) is zero iff u is an ultrametric. The rho values are increased according to the rule rho_{k+1} = q rho_k for some constant q > 1, until convergence is obtained in the sense that the Euclidean distance between successive solutions u_k and u_{k+1} is small enough. We then use a final rounding step to ensure that the returned object exactly satisfies the ultrametric constraints. The starting value u_0 is obtained by “random shaking” of the given dissimilarity object.

The unconstrained minimizations are carried out using either optim or nlm, using the analytic gradients given in Carroll and Pruzansky (1980). The following control parameters can be provided via the control argument.

method
a character string, or NULL. If not given, "CG" is used. If equal to "nlm", minimization is carried out using nlm. Otherwise, optim is used with method as the given method.
control
a list of control parameters to be passed to the minimization routine in case optim is used.
eps
the absolute convergence tolerance. Defaults to sqrt(.Machine$double.eps).
q
a double greater than one controlling the growth of the rho_k as described above. Defaults to 10.
verbose
a logical indicating whether to provide some output on minimization progress. Defaults to getOption("verbose").

The default optimization using conjugate gradients should work reasonably well for medium to large size problems. For “small” ones, using nlm is usually faster. Note that the number of ultrametric constraints is of the order n^3, where n is the number of objects in the dissimilarity object, suggesting to use the SUMT approach in favor of constrOptim.

It should be noted that the SUMT approach is a heuristic which can not be guaranteed to find the global minimum. Standard practice would recommend to use the best solution found in “sufficiently many” replications of the base algorithm.

Value

An object of class "cl_ultrametric" containing the optimal ultrametric distances.

References

J. D. Carroll and S. Pruzansky (1980). Discrete and hybrid scaling models. In E. D. Lantermann and H. Feger (eds.), Similarity and Choice. Bern (Switzerland): Huber.

M. Krivanek and J. Moravek (1986). NP-hard problems in hierarchical tree clustering. Acta Informatica, 23, 311–323.

G. de Soete (1986). A least squares algorithm for fitting an ultrametric tree to a dissimilarity matrix. Pattern Recognition Letters, 2, 133–137.

See Also

cl_median for computing median hierarchies by least squares fitting average ultrametric distances.

Examples

## Least squares fit of an ultrametric to the Miller-Nicely consonant
## phoneme confusion data.
data("Phonemes")
## Note that the Phonemes data set has the consonant misclassification
## probabilities, i.e., the similarities between the phonemes.
d <- 1 - as.dist(Phonemes)
u <- ls_fit_ultrametric(d, control = list(verbose = TRUE))
## Cophenetic correlation:
cor(d, u)
## Dendrogram:
plot(u)
## ("Basically" the same as Figure 1 in de Soete (1986).)

[Package clue version 0.2-0 Index]