kml {kml} | R Documentation |
KmL
is a new implematation of k-means for longitudinal data (or trajectories). This algorithm is able to deal with missing value and
provides an easy way to re roll the algorithm several times, varying the starting conditions and/or the number of clusters looked for.
Here is the description of the algorithm. For an overview of the package, see kml-package.
kml(Object, nbClusters = 2:6, nbRedrawing = 20, saveFreq = 100, maxIt = 200, printCal = FALSE, printTraj = FALSE, distance=function(x,y){return(dist(t(cbind(x,y))))})
Object |
[ClusterizLongData]: contains trajectories to clusterize as well as pre-existing clusterizations. |
nbClusters |
[vector(numeric)]: Vector containing the number of clusters
with which kml must work. By default,
nbClusters is 2:6 which indicates that kml must
search partitions with respectively 2, then 3, ... up to 6
clusters. Maximum number of cluster is 25. |
nbRedrawing |
[numeric]: Sets the number of k-means (with different starting conditions) that must be run for each number of clusters. |
saveFreq |
[numeric]: Long computations can take several
days. So it is possible to save the object ClusterizLongData
once in a wilde. saveFreq define the frequency of the saving
process. The ClusterizLongData is saved every saveFreq
clusterization calculations. The object is save in the file
objectName.Rdata in the curent folder. |
maxIt |
[numeric]: Sets a limit to the number of iteration if convergeance is not reach. |
printCal |
[logical]: If TRUE, the calinski criterion will be print on screen during computation (if the number of redrawing is big, this can slow the overall calculation process). |
printTraj |
[logical]: If TRUE, each step of k-means is print on screen during the calculation. This slow the overall calculation process by a factor 25, see "optimisation" below. |
distance |
[function(numeric,numeric)] function that compute the distance between two trajectories. The default function is the Euclidian distance with Gower adjustment (Gower adjustment takes in accomp missing value.) Changing the distance can slow the overall calculation process by a factor 25, see "optimisation" below. |
kml
works on object of class ClusterizLongData
.
For each number included in nbClusters
, kml
compute a
Clusterization
then stores it in the field
clusters
of the object ClusterizLongData
according to its number of clusters.
The algorithm starts over as many times as it is told in nbRedrawing
. By default, it is executed for 2,
3, 4, 5 and 6 clusters 20 times each, namely 100 times.
When a Clusterization
has been found, it is added to the slot
clusters
. clusters
is a list of 25 sublist called c1,
c2, c3 until c25. The sublist cX store the all Clusterization
with
X clusters. Inside a sublist, the
Clusterization
are sort from the biggest Calinski criterion to
the smallest. So the best are stored first.
Note that Clusterization
are saved throughout the algorithm. If the user
interrupt the execution of kml
, the result is not lost. If the
user run kml on an object, then run kml again on the same object, the
Clusterization
that are computed the second time are added to
the one allready present in the object (unless you use "clear" some
list, see "Object["clusters","clear"]<-value
" in ClusterizLongData
).
A ClusterizLongData
object, after having added
some Clusterization
to it.
Behind kml, even if the final user does not see it, there is two different procedure :
distance
(Euclidean with Gower
adjustement) and the default printTraj
(FALSE), kml call a C
compiled and optimized procedure.
The C prodecure is 25 times faster than the R one.
So we advice to use the R procedure 1/ for trying some new method
(like using a new distance) or 2/ to "see" the very first cluster
construction, in order to check that every thing goes right, then to
sweetch to the C procedure (like we do in Example
section).
If for a specific use, you need a different distance, feel free to contact the author.
Christophe Genolini
PSIGIAM: Paris Sud Innovation Group in Adolescent Mental Health
INSERM U669 / Maison de Solenn / Paris
Contact author: <genolini@u-paris10.fr>
Raphaël Ricaud
Laboratoire "Sport & Culture" / "Sports & Culture" Laboratory
University of Paris 10 / Nanterre
Article submited
Web site: http://christophe.genolini.free.fr/kml
Overview: kml-package
Classes : ClusterizLongData
, Clusterization
, ArtificialLongData
Methods : clusterizLongData
, generateArtificialLongData
, choice
### Generation of some data cld1 <- as.cld(generateArtificialLongData()) ### We suspect 2, 3, 4 or 5 clusters, we want 3 redrawing. # We want to "see" what happen (so printCal and printTraj are TRUE) kml(cld1,2:6,3,printCal=TRUE,printTraj=TRUE) ### 4 seems to be the best. But to be sure, we try more redrawing 4 or 6 only. # We don't want to see again, we want to get the result as fast as possible. kml(cld1,c(4,6),10)