dissimilarity {clue} | R Documentation |
Compute the dissimilarity between (ensembles) of partitions or hierarchies.
cl_dissimilarity(x, y = NULL, method = "euclidean", ...)
x |
an ensemble of partitions or hierarchies and dissimilarities,
or something coercible to that (see cl_ensemble ). |
y |
NULL (default), or as for x . |
method |
a character string specifying one of the built-in
methods for computing dissimilarity, or a function to be taken as
a user-defined method. If a character string, its lower-cased
version is matched against the lower-cased names of the available
built-in methods using pmatch . See Details for
available built-in methods. |
... |
further arguments to be passed to methods. |
If y
is given, its components must be of the same kind as those
of x
(i.e., components must either all be partitions, or all be
hierarchies or dissimilarities).
If all components are partitions, the following built-in methods for measuring dissimilarity between two partitions with respective membership matrices u and v (brought to a common number of columns) are available:
"euclidean"
"manhattan"
"comemberships"
"symdiff"
"Rand"
"GV1"
"BA/d"
"BA/A"
is the minimum number of single element moves (move
from one class to another or a new one) needed to transform one
partition into the other. Introduced in Rubin (1967).
"BA/C"
is the mininum number of lattice moves for
transforming one partition into the other, where partitions are
said to be connected by a lattice move if one is just finer
than the other (i.e., there is no other partition between them) in
the partition lattice (see cl_meet
). Equivalently,
with z the join of x
and y
and S giving
the number of classes, this can be written as S(x) + S(y) - 2
S(z). Attributed to David Pavy.
"BA/D"
is the “pair-bonds” distance, which can be
defined as S(x) + S(y) - 2 S(z), with z the meet of
x
and y
and S the supervaluation (i.e.,
non-decreasing with respect to the partial order on the partition
lattice) function sum_i (n_i (n_i - 1)) / (n (n - 1)),
where the n_i are the numbers of objects in the respective
classes of the partition (such that n_i (n_i - 1) / 2 are the
numbers of pair bonds in the classes), and n the total
number of objects.
"BA/E"
is the normalized information distance, defined as
1 - I / H, where I is the average mutual information
between the partitions, and H is the average entropy of the
meet z of the partitions. Introduced in Rajski (1961).
(Boorman and Arabie also discuss a distance measure (B) based on the minimum number of set moves needed to transform one partition into the other, which, differently from the A and C distance measures is hard to compute (Day, 1981) and (currently) not provided.)
"VI"
...
has an argument named weights
, it is taken to
specify case weights."Mallows"
p
, alpha
, and beta
, respectively.For hard partitions, both Manhattan and squared Euclidean dissimilarity give twice the transfer distance (Charon et al., 2005), which is the minimum number of objects that must be removed so that the implied partitions (restrictions to the remaining objects) are identical. This is also known as the R-metric in Day (1981), i.e., the number of augmentations and removals of single objects needed to transform one partition into the other, and the partition-distance in Gusfield (2002), and equals twice the number of single element moves distance of Boorman and Arabie.
For hard partitions, the pair-bonds (Boorman-Arabie D) distance is identical to the Rand distance, and can also be written as the Manhattan distance between the co-membership matrices corresponding to the partitions, or equivalently, their symdiff distance, normalized by n (n - 1).
If all components are hierarchies, available built-in methods for measuring dissimilarity between two hierarchies with respective ultrametrics u and v are as follows.
"euclidean"
"manhattan"
"cophenetic"
"gamma"
"symdiff"
"Chebyshev"
"Lyapunov"
"BO"
If ...
has an argument named delta
it is taken to
specify the partition dissimilarity delta to be employed.
The measures based on ultrametrics also allow computing dissimilarity
with “raw” dissimilarities on the underlying objects (R objects
inheriting from class "dist"
).
If a user-defined dissimilarity method is to be employed, it must be a function taking two clusterings as its arguments.
Symmetric dissimilarity objects of class "cl_dissimilarity"
are
implemented as symmetric proximity objects with self-proximities
identical to zero, and inherit from class "cl_proximity"
. They
can be coerced to dense square matrices using as.matrix
. It
is possible to use 2-index matrix-style subscripting for such objects;
unless this uses identical row and column indices, this results in a
(non-symmetric dissimilarity) object of class
"cl_cross_dissimilarity"
.
Symmetric dissimilarity objects also inherit from class
"dist"
(although they currently do not “strictly”
extend this class), thus making it possible to use them directly for
clustering algorithms based on dissimilarity matrices of this class,
see the examples.
If y
is NULL
, an object of class
"cl_dissimilarity"
containing the dissimilarities between all
pairs of components of x
. Otherwise, an object of class
"cl_cross_dissimilarity"
with the dissimilarities between the
components of x
and the components of y
.
S. A. Boorman and P. Arabie (1972). Structural measures and the method of sorting. In R. N. Shepard, A. K. Romney, & S. B. Nerlove (eds.), Multidimensional Scaling: Theory and Applications in the Behavioral Sciences, 1: Theory (pages 225–249). New York: Seminar Press.
S. A. Boorman and D. C. Olivier (1973). Metrics on spaces of finite trees. Journal of Mathematical Psychology, 10, 26–59.
I. Charon, L. Denoeud, A. Guénoche and O. Hudry (2005). Maximum Transfer Distance Between Partitions. Technical Report 2005D003, Ecole Nationale Supérieure des Télécommunications — Paris. http://www.enst.fr/_data/files/docs/id_515_1128675112_271.pdf
W. E. H. Day (1981). The complexity of computing metric distances between partitions. Mathematical Social Sciences, 1, 269–287.
E. Dimitriadou, A. Weingessel and K. Hornik (2002). A combination scheme for fuzzy clustering. International Journal of Pattern Recognition and Artificial Intelligence, 16, 901–912.
A. D. Gordon and M. Vichi (2001). Fuzzy partition models for fitting a set of partitions. Psychometrika, 66, 229–248.
D. Gusfield (2002). Partition-distance: A problem and class of perfect graphs arising in clustering. Information Processing Letters, 82, 159–164.
N. Jardine and E. Sibson (1971). Mathematical Taxonomy. London: Wiley.
M. Meila (2003). Comparing clusterings by the variation of information. In B. Schölkopf and M. K. Warmuth (eds.), Learning Theory and Kernel Machines, pages 173–187. Springer-Verlag: Lecture Notes in Computer Science 2777.
C. Rajski (1961). A metric space of discrete probability distributions, Information and Control, 4, 371–377.
J. Rubin (1967). Optimal classification into groups: An approach for solving the taxonomy problem. Journal of Theoretical Biology, 15, 103–144.
D. Zhou, J. Li and H. Zha (2005). A new Mallows distance based metric for comparing clusterings. In Proceedings of the 22nd international Conference on Machine Learning (Bonn, Germany, August 07–11, 2005), pages 1028–1035. ICML '05, volume 119. ACM Press, New York, NY. DOI: http://doi.acm.org/10.1145/1102351.1102481
## An ensemble of partitions. data("CKME") pens <- CKME[1 : 30] diss <- cl_dissimilarity(pens) summary(c(diss)) cl_dissimilarity(pens[1:5], pens[6:7]) ## Equivalently, using subscripting. diss[1:5, 6:7] ## Can use the dissimilarities for "secondary" clustering ## (e.g. obtaining hierarchies of partitions): hc <- hclust(diss) plot(hc) ## Example from Boorman and Arabie (1972). P1 <- as.cl_partition(c(1, 2, 2, 2, 3, 3, 2, 2)) P2 <- as.cl_partition(c(1, 1, 2, 2, 3, 3, 4, 4)) cl_dissimilarity(P1, P2, "BA/A") cl_dissimilarity(P1, P2, "BA/C") ## Hierarchical clustering. d <- dist(USArrests) x <- hclust(d) cl_dissimilarity(x, d, "cophenetic") cl_dissimilarity(x, d, "gamma")