nncluster {nnclust}R Documentation

Fast clustering with restarted minimum spanning tree.

Description

Uses Prim's algorithm to build a minimum spanning tree for each cluster, stopping when the nearest-neighbour distance rises above a specified threshold. Returns a set of clusters and a set of 'outliers' not in any cluster. trimCluster tidies up the output by removing small clusters, clusterMember returns cluster membership for the original data points.

Usage

nncluster(x, threshold, fill = 0.95, maxclust = 20, give.up = 500,verbose=FALSE,start=NULL)
trimCluster(nnclust, size=10)
clusterMember(nnclust, outlier=TRUE)
nearestCluster(nnclust, threshold=Inf,outlier=FALSE)

Arguments

x data matrix
threshold Threshold for stopping the tree building within a cluster. The tree stops when the squared euclidean distance to the closest point to the tree is greater than this. If threshold is a vector, the elements will be used in succession, with the last element repeated as necessary.
fill Stop when the clusters make up this fraction of the data.
maxclust Stop at this many clusters
give.up Stop when fewer than this many pairs have nearest-neighbour distance less than threshold.
verbose Print some cluster summaries before restarting?
nnclust An object of class nncluster, returned by nncluster
size Clusters smaller than this are added to the 'outlier' set
outlier If FALSE, use NA for the cluster identifier for outliers
start integer index to start the minimum spanning tree at this observation

Details

Works best for well-separated clusters in up to 8 dimensions, and sample sizes up to hundreds of thousands.

If you want a complete minimum spanning tree, run mst on the outlier set and then use nnfind to find the shortest links connecting the clusters. When there are well-separated clusters this will be faster than running mst once on the whole data set.

clusterMember returns a vector of integers indicating cluster membership. Outliers are treated as a separate cluster if outlier is TRUE, otherwise they code as NA. nearestCluster assigns outliers at distance less than threshold from a cluster to the cluster whose nearest member is closest.

trimCluster returns a new nncluster object with small clusters converted to outliers. There must be at least one cluster larger than size.

Value

A list of class nncluster. Each element but the last describes a cluster, with components mst containing the tree, x containing the data, and rows containing row numbers in the initial data set.
The last element describes the unclustered outliers and has no mst component.

Note

The performance of this algorithm depends critically on the performance of the nearest-neighbour finder, and can decay catastrophically if too many uninformative variables are added.

The performance can also be poor if the data are close to being ordered on some of the variables.

Author(s)

Thomas Lumley

See Also

mst, nnfind

Examples

x<-scale(faithful)
a<-nncluster(x, threshold=0.1, give.up=0, fill=1)
a
id<-clusterMember(a)
plot(faithful, col=id, pch=19)

[Package nnclust version 2.2 Index]