cluster.Gen {clusterSim} | R Documentation |
Random cluster generation with known structure of clusters (optionally with noisy variables and outliers)
cluster.Gen(numObjects=50, means=NULL, cov=NULL, fixedCov=TRUE, model=1, dataType="m",numCategories=NULL, numNoisyVar=0, numOutliers=0, rangeOutliers= c(1,10), inputType="csv2", inputHeader=TRUE, inputRowNames=TRUE, outputCsv="", outputCsv2="", outputColNames=TRUE, outputRowNames=TRUE)
numObjects |
number of objects in each cluster - positive integer value or vector with the same size as nrow(means),
e.g. numObjects=c(50,20) |
means |
matrix of cluster means (e.g. means=matrix(c(0,8,0,8),2,2) ). If means = NULL matrix should be read from means_<modelNumber>.csv file |
cov |
covariance matrix (the same for each cluster, e.g. cov=matrix(c(1, 0, 0, 1), 2, 2) ).
If cov=NULL matrix should be read from
cov_<modelNumber>.csv file. Note: you cannot use this argument for generation of clusters with different covariance matrices. Those kind of generation should be done by setting fixedCov to FALSE and using appropriate model |
model |
model number,
model=1 - no cluster structure. Observations are simulated from uniform distribution over the unit hypercube in number of
dimensions (variables) given in numNoisyVar argument;
model=2 - means and covariances are taken from arguments means and cov (see Example 1);
model=3,4,...,20 - see file
$R_HOME\library\clusterSim\pdf\clusterGen_details.pdf; model=21,22,... - if fixedCov=TRUE means should be read from means_<modelNumber>.csv
and covariance matrix for all clusters should be read from cov_<modelNumber>.csv
and if fixedCov=FALSE means should be read from means_<modelNumber>.csv
and covariance matrices should be read separately for each cluster
from
cov_<modelNumber>_<clusterNumber>.csv |
fixedCov |
if fixedCov=TRUE covariance matrix for all clusters is the same
and if
fixedCov=FALSE each cluster is generated from different covariance matrix - see model |
dataType |
"m" - metric (ratio, interval), "o" - ordinal, "s" - symbolic interval |
numCategories |
number of categories (for ordinal data only). Positive integer value or vector with the same size as ncol(means) plus number of noisy variables. |
numNoisyVar |
number of noisy variables. For model=1 it means number of variables |
numOutliers |
number of outliers (for metric and symbolic interval data only). If a positive integer - number of outliers, if value from <0,1> - percentage of outliers in whole data set |
rangeOutliers |
range for outliers (for metric and symbolic interval data only). The default range is [1, 10].The outliers are generated independently for each variable for the whole data set from uniform distribution. The generated values are randomly added to maximum of j-th variable or subtracted from minimum of j-th variable |
inputType |
"csv" - a dot as decimal point or "csv2" - a comma as decimal point in
means_<modelNumber>.csv and cov_<modelNumber>.csv files |
inputHeader |
inputHeader=TRUE indicates that input files (means_<modelNumber>.csv; cov_<modelNumber...>.csv) contain header row |
inputRowNames |
inputRowNames=TRUE indicates that input files (means_<modelNumber>.csv; cov_<modelNumber...>.csv) contain first column with row names or with number of objects (positive integer values) |
outputCsv |
optional, name of csv file with generated data (first column contains id, second - number of cluster and others - data) |
outputCsv2 |
optional, name of csv (a comma as decimal point and a semicolon as field separator) file with generated data (first column contains id, second - number of cluster and others - data) |
outputColNames |
outputColNames=TRUE indicates that output file (given by outputCsv and outputCsv2 parameters) contains first row with column names |
outputRowNames |
outputRowNames=TRUE indicates that output file (given by outputCsv and outputCsv2 parameters) contains a vector of row names |
See file $R_HOME\library\clusterSim\pdf\clusterGen_details.pdf for further details
clusters |
cluster number for each object, for model=1 each
object belongs to its own cluster thus this variable contains objects numbers |
data |
generated data: for metric and ordinal data - matrix with objects in rows and variables in columns; for symbolic interval data three-dimensional structure: first dimension represents object number, second - variable number and third dimension contains lower- and upper-bounds of intervals |
Marek Walesiak marek.walesiak@ue.wroc.pl, Andrzej Dudek andrzej.dudek@ue.wroc.pl
Department of Econometrics and Computer Science, University of Economics, Wroclaw, Poland http://keii.ue.wroc.pl/clusterSim
Billard, L., Diday, E. (2006): Symbolic data analysis. Conceptual statistics and data mining, Wiley, Chichester.
Qiu, W., Joe, H. (2006), Generation of random clusters with specified degree of separation, "Journal of Classification", vol. 23, 315-334.
Steinley, D., Henson, R. (2005), OCLUS: an analytic method for generating clusters with known overlap, "Journal of Classification", vol. 22, 221-250.
Walesiak, M., Dudek, A. (2008), Identification of noisy variables for nonmetric and symbolic data in cluster analysis, In: C. Preisach, H. Burkhardt, L. Schmidt-Thieme, R. Decker (Eds.), Data analysis, machine learning and applications, Springer-Verlag, Berlin, Heidelberg, 85-92.
# Example 1 library(clusterSim) means <- matrix(c(0,7,0,7),2,2) cov <- matrix(c(1,0,0,1),2,2) grnd <- cluster.Gen(numObjects=60,means=means,cov=cov,model=2, numOutliers=8) colornames <- c("red","blue","green") grnd$clusters[grnd$clusters==0]<-length(colornames) plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE) # Example 2 library(clusterSim) grnd <- cluster.Gen(50,model=4,dataType="m",numNoisyVar=2) data <- as.matrix(grnd$data) colornames <- c("red","blue","green") plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE) # Example 3 library(clusterSim) grnd<-cluster.Gen(50,model=4,dataType="o",numCategories=7, numNoisyVar=2) plotCategorial(grnd$data,,grnd$clusters,ask=TRUE) # Example 4 (1 nonnoisy variable and 2 noisy variables, 3 clusters) library(clusterSim) grnd <- cluster.Gen(c(40,60,20), model=2, means=c(2,14,25), cov=c(1.5,1.5,1.5),numNoisyVar=2) colornames <- c("red","blue","green") plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE) # Example 5 library(clusterSim) grnd <- cluster.Gen(c(20,35,20,25),model=14,dataType="m",numNoisyVar=1, fixedCov=FALSE, numOutliers=0.1, outputCsv2="data14.csv") data <- as.matrix(grnd$data) colornames <- c("red","blue","green","brown","black") grnd$clusters[grnd$clusters==0]<-length(colornames) plot(grnd$data,col=colornames[grnd$clusters],ask=TRUE) # Example 6 (this example needs files means_24.csv) # and cov_24.csv to be placed in working directory # library(clusterSim) # grnd<-cluster.Gen(c(50,80,20),model=24,dataType="m",numNoisyVar=1, # numOutliers=10, rangeOutliers=c(1,5)) # print(grnd) # data <- as.data.frame(grnd$data) # colornames<-c("red","blue","green","brown") # grnd$clusters[grnd$clusters==0]<-length(colornames) # plot(data,col=colornames[grnd$clusters],ask=TRUE) # Example 7 (this example needs files means_25.csv and cov_25_1.csv) # cov_25_2.csv, cov_25_3.csv, cov_25_4.csv, cov_25_5.csv # to be placed in working directory # library(clusterSim) # grnd<-cluster.Gen(c(40,30,20,35,45),model=25,numNoisyVar=3,fixedCov=F) # data <- as.data.frame(grnd$data) # colornames<-c("red","blue","green","magenta","brown") # plot(data,col=colornames[grnd$clusters],ask=TRUE)