dbscan {fpc}R Documentation

Clustering: DBSCAN density reachability and connectivity

Description

Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).

Usage

  dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw",
    "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL)
  ## S3 method for class 'dbscan':
  print(x, ...)
  ## S3 method for class 'dbscan':
  plot(x, data, ...)
  ## S3 method for class 'dbscan':
  predict(object, data, newdata = NULL,
predict.max=1000, ...)

Arguments

data data matrix, data.frame, dissimilarity matrix or dist-object
eps Reachability Distance
MinPts Reachability minimum no. of points
scale scale the data
method "dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distance matrices (very fast with moderate memory requirements )
seeds FALSE to not include the isseed-vector in the dbscan-object
showplot 0 = no plot, 1 = plot per iteration, 2 = plot per subiteration
countmode NULL or vector of point numbers at which to report progress
x object of class dbscan.
object object of class dbscan.
newdata matrix or data.frame with raw data to predict
predict.max max. batch size for predictions
... Further arguments transferred to plot methods.

Details

Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise).

We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.

print.dbscan shows a statistic of the number of points belonging to the clusters that are seeds and border points.

plot.dbscan distinguishes between seed and border points by plot symbol.

Value

predict.dbscan gives out a vector of predicted clusters for the points in newdata.
dbscan gives out an object of class 'dbscan' which is a LIST with components

cluster integer vector coding cluster membership with noise observations (singletons) coded as 0
isseed logical vector indicating whether a point is a seed (not border, not noise)
eps parameter eps
MinPts parameter MinPts

Note

this is a simplified version of the original algorithm (no K-D-trees used), thus we have o(n^2) instead of o(n*log(n))

Author(s)

Jens Oehlschlaegel, based on a draft by Christian Hennig.

References

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

Examples

  set.seed(665544)
  n <- 600
  x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
    sd=0.2))
  par(bg="grey40")
  ds <- dbscan(x, 0.2, showplot=TRUE)
  ds
  plot(ds, x)

  x2 <- matrix(0,nrow=4,ncol=2)
  x2[1,] <- c(5,2)
  x2[2,] <- c(8,3)
  x2[3,] <- c(4,4)
  x2[4,] <- c(9,9)
  predict(ds, x, x2)

  n <- 600
  x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2))

  system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3] 
  system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3]
  system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL,
    method="dist"))[3]

[Package fpc version 1.2-3 Index]