distance {analogue} | R Documentation |
Flexibly calculates distance or dissimilarity measures between a
training set "x"
and a fossil or test set "y"
. If
"y"
is not supplied then the pairwise dissimilarities between
samples in the training set, "x"
, are calculated.
distance(x, y, method = c("euclidean", "SQeuclidean", "chord", "SQchord", "bray", "chi.square", "SQchi.square", "information", "chi.distance", "manhattan", "kendall", "gower", "alt.gower", "mixed"), weights = NULL, R = NULL)
x |
data frame or matrix containing the training set samples. |
y |
data frame or matrix containing the fossil or test set samples. |
method |
character; which choice of dissimilarity coefficient to use. One of the listed options. See Details below. |
weights |
numeric; vector of weights for each descriptor. |
R |
numeric; vector of ranges for each descriptor. |
A range of dissimilarity coefficients can be used to calculate dissimilarity between samples. The following are currently available:
euclidean | d[jk] = sqrt(sum (x[ij]-x[ik])^2) |
SQeuclidean | d[jk] = sum (x[ij]-x[ik])^2 |
chord | d[jk] = sqrt(sum((sqrt(x) - sqrt(y))^2)) |
SQchord | d[jk] = sum((sqrt(x) - sqrt(y))^2) |
bray | d[jk] = sum(abs(x - y)) / sum(x + y) |
chi.square | d[jk] = sqrt(sum(((x - y)^2) / (x + y))) |
SQchi.square | d[jk] = sum(((x - y)^2) / (x + y)) |
information | d[jk] = sum((x[ij] * log((2 * x[ij]) / (x[ij] + x[ik]))) + (x[ik] * log((2 * x[ik]) / (x[ij] + x[ik])))) |
chi.distance | d[jk] = sqrt(sum((x[ij] - x[ik])^2 / (x[i+] / x[++]))) |
manhattan | d[jk] = sum (|x[ij]-x[ik]|) |
kendall | d[jk] = sum (MAX[i] - min(x[ij]-x[ik])) |
gower | d[jk] = sum(abs(x[ij] - x[ik]) / R[i]) |
alt.gower | d[jk] = sqrt(2 * sum(abs(x[ij] - x[ik]) / R[i])) |
where R[i] is the range of proportions for descriptor (variable) i | |
mixed | d[jk] = sum(w[i] * s[jki]) / sum(w[i]) |
where w[i] is the weight for descriptor i and s[jki] is the similarity | |
between samples j and k for descriptor (variable) i. |
A matrix of dissimilarities where columns are the samples in
"y"
and the rows the samples in "x"
. If "y"
is
not provided then a square, symmetric matrix of pairwise sample
dissimilarities for the training set "x"
is returned.
The dissimilarities are calculated in native R code. As such, other
implementations (see See Also below) will be quicker. This is done for
one main reason - it is hoped to allow a user defined function to be
supplied as argument "method"
to allow for user-extension of
the available coefficients.
The other advantage of distance
over other implementations, is
the simplicity of calculating only the required pairwise sample
dissimilarities between each fossil sample ("y"
) and each
training set sample ("x"
). To do this in other implementations,
you would need to merge the two sets of samples, calculate the full
dissimilarity matrix and then subset it to achieve similar results.
Gavin L. Simpson
Faith, D.P., Minchin, P.R. and Belbin, L. (1987) Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69, 57–68.
Gavin, D.G., Oswald, W.W., Wahl, E.R. and Williams, J.W. (2003) A statistical approach to evaluating distance metrics and analog assignments for pollen records. Quaternary Research 60, 356–367.
Kendall, D.G. (1970) A mathematical approach to seriation. Philosophical Transactions of the Royal Society of London - Series B 269, 125–135.
Legendre, P. and Legendre, L. (1998) Numerical Ecology, 2nd English Edition. Elsevier Science BV, The Netherlands.
Overpeck, J.T., Webb III, T. and Prentice I.C. (1985) Quantitative interpretation of fossil pollen spectra: dissimilarity coefficients and the method of modern analogues. Quaternary Research 23, 87–108.
Prentice, I.C. (1980) Multidimensional scaling as a research tool in Quaternary palynology: a review of theory and methods. Review of Palaeobiology and Palynology 31, 71–104.
vegdist
in package vegan
,
daisy
in package cluster
, and
dist
provide comparable functionality for the
case of missing "y"
and are implemented in compiled code, so
will be faster.
## simple example using dummy data train <- data.frame(matrix(abs(runif(200)), ncol = 10)) rownames(train) <- LETTERS[1:20] colnames(train) <- as.character(1:10) fossil <- data.frame(matrix(abs(runif(100)), ncol = 10)) colnames(fossil) <- as.character(1:10) rownames(fossil) <- letters[1:10] ## calculate distances/dissimilarities between train and fossil ## samples test <- distance(train, fossil) ## using a different coefficient, chi-square distance test <- distance(train, fossil, method = "chi.distance") ## calculate pairwise distances/dissimilarities for training ## set samples test2 <- distance(train) ## calculate Gower's general coefficient for mixed data ## first, make a couple of variables factors fossil[,4] <- factor(sample(rep(1:4, length = 10), 10)) train[,4] <- factor(sample(rep(1:4, length = 20), 20)) ## now fit the mixed coefficient test3 <- distance(train, fossil, "mixed") ## Example from page 260 of Legendre & Legendre (1998) x1 <- t(c(2,2,NA,2,2,4,2,6)) x2 <- t(c(1,3,3,1,2,2,2,5)) Rj <- c(1,4,2,4,1,3,2,5) # supplied ranges distance(x1, x2, method = "mixed", R = Rj) ## note this gives 1 - 0.66 (not 0.66 as the answer in ## Legendre & Legendre) as this is expressed as a ## distance whereas Legendre & Legendre describe the ## coefficient as similarity coefficient