gower.dist {StatMatch}R Documentation

Computes the Gower's Distance

Description

This function computes the Gower's distance (dissimilarity) among units in a dataset or among observations in two distinct datasets.

Usage

gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE)

Arguments

data.x A matrix or a data frame containing variables that should be used in the computation of the distance.
Columns of mode numeric will be considered as interval scaled variables; columns of mode character or class factor will be considered as categorical nominal variables; columns of class ordered will be considered as categorical ordinal variables and, columns of mode logical will be considered as binary asymmetric variables (see Details for further information).
Missing values (NA) are allowed.
If only data.x is supplied, the dissimilarities between rows of data.x will be computed.
data.y A numeric matrix or data frame with the same variables, of the same type, as those in data.x. Dissimilarities between rows of data.x and rows of data.y will be computed. If not provided, by default it is assumed equal to data.x and only dissimilarities between rows of data.x will be computed.
rngs A vector with the ranges to scale the variables. Its length must be equal to number of variables in data.x. In correspondence of nonnumeric variables, just put 1 or NA. When rngs=NULL (default) the range of a numeric variable is estimated by jointly considering the values for the variable in data.x and those in data.y. Therefore, assuming rngs=NULL, if a variable "X1" is considered:
rngs["X1"] <- max(data.x[,"X1"], data.y[,"X1"]) - 
               min(data.x[,"X1"], data.y[,"X1"])
.
KR.corr When TRUE (default) the extension of the Gower's dissimilarity measure proposed by Kaufman and Rousseeuw (1990) is used. Otherwise, when KR.corr=FALSE, the original Gower's (1971) dissimilarity is considered.

Details

This function computes distances among records when variables of different type (categorical and continuous) have been observed. In order to handle different types of variables, the Gower's dissimilarity coefficient (Gower, 1971) is used.

By default (KR.corr=TRUE) the Kaufman and Rousseeuw (1990) extension of the Gower's dissimilarity coefficient is used. The final dissimilarity between the ith and jth units is obtained as a weighted sum of dissimilarities for each variable:

d(i,j) = sum_k(delta_ijk * d_ijk ) / sum_k( delta_ijk )

In particular, d_ijk represents the distance between the ith and jth unit computed considering the kth variable. It depends on the nature of the variable:

As far as the weight delta_ijk is concerned:

In practice, NAs and couple of cases with x_ik = x_jk = FALSE do not contribute to distance computation.

Value

A matrix object with distances among rows of data.x and those of data.y.

Author(s)

Marcello D'Orazio madorazi@istat.it

References

Gower, J. C. (1971), “A general coefficient of similarity and some of its properties”. Biometrics, 27, 623–637.

Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.

See Also

daisy, dist

Examples


x1 <- as.logical(rbinom(10,1,0.5)) 
x2 <- sample(letters, 10, replace=TRUE)
x3 <- rnorm(10)
x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE))
xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)

# matrix of distances among observations in xx
gower.dist(xx)

# matrix of distances among first obs. in xx
# and the remaining ones
gower.dist(data.x=xx[1:3,], data.y=xx[4:10,])


[Package StatMatch version 0.6 Index]