seqrep {TraMineR} | R Documentation |
The function attempts to find an optimal set of representative sequences that exhibits the key features of the whole sequence data set, the goal being to get easy sounded interpretation of the latter.
seqrep(seqdata, criterion="density", score=NULL, decreasing=TRUE, trep=0.25, nrep=NULL, tsim=0.1, dmax=NULL, dist.matrix=NULL, ...)
seqdata |
a state sequence object as defined by the seqdef function. |
criterion |
the representativeness criterion for sorting the candidate list. One of "freq" (sequence
frequency), "density" (neighborhood density), "mscore" (mean state frequency), "dist"
(centrality) and "prob" (sequence likelihood). See details. |
score |
an optional vector containing the representativeness scores used to sort the sequences in the candidate list. The length of the vector must be equal to the number of sequences in the sequence object. |
decreasing |
if a score vector is provided, indicates whether the objects in the candidate list must be sorted in
ascending or descending order of this score. Default is TRUE, i.e. descending. The first object in the candidate list
is then supposed to be the most representative. |
trep |
coverage threshold, i.e. minimum proportion of sequences that should have a representative in their
neighborhood (neighborhood diameter is defined by tsim ). |
nrep |
number of representative sequences. If NULL (default), the size of the representative set is
controlled by trep . |
tsim |
neighborhood diameter as a percentage of the maximum (theoretical) distance. Defaults to 0.1 (10%). This diameter serves for evaluating redundancy. |
dmax |
maximum theoretical distance. The neighborhood diameter is defined as a proportion of this
maximum distance. If NULL , it is derived from the distance matrix. |
dist.matrix |
a matrix containing the pairwise distances between sequences in seqdata . If NULL , the
matrix is computed by calling the seqdist function. In that case, optional arguments to be passed to
the seqdist function (see ... hereafter) should also be provided. |
... |
optional arguments to be passed to the seqdist function, mainly dist.method specifying the
metric for computing the distance matrix, norm for normalizing the distances, indel and sm for
indel and substitution costs when Optimal Matching metric is chosen. See seqdist manual page for
details. |
The representative set is obtained by an heuristic that first builds a sorted list of candidates using a representativeness
score and then eliminates redundancy. The available criterions for sorting the candidate list are: sequence frequency, neighborhood density, mean state frequency, centrality and sequence likelihood.
The sequence frequency criterion uses the sequence frequencies as representativeness score. The more frequent a sequence the more representative it is supposed to be. Hence, sequences are sorted in decreasing frequency order.
The neighborhood density criterion uses the number — the density — of sequences in the neighborhood of each
candidate sequence. This requires indeed to set the neighborhood diameter tsim
. We suggest to set it as a given
proportion of the maximal theoretical distance between two sequences. Sequences are sorted in decreasing density order.
The mean state frequency criterion is the mean value of the transversal frequencies of the successive states. Let s=(s_1, s_2, ..., s_l) be a sequence of length l and f(s_1), f(s_2), ..., f(s_l) the frequencies of the states at (time-)position t_1, t_2, ..., t_l. The mean state frequency is the sum of the state frequencies divided by the sequence length
MSF(s)=1/l sum f(s_i)
The lower and upper boundaries of MSF are 0 and 1. MSF is equal to 1 when all the sequences in the set are the same, i.e. when there is a single distinct sequence. The most representative sequence is the one with the highest score.
The centrality criterion uses the sum of distances to all other sequences as a representativeness criterion. The smallest the sum, the most representative the sequence.
The sequence likelihood P(s) is defined as the product of the probability with which each of its observed successive state is supposed to occur at its position. Let s_1, s_2, s_l be a sequence of length l. Then
P(s)=P(s_1,1) * P(s_2,2) * ... * P(s_l,l)
with P(s_t,t) the probability to observe state s_t at position t.
The question is how to determinate the state probabilities P(s_t,t). One commonly used method for
computing them is to postulate a Markov model, which can be of various order. The implemented criterion considers the
probabilities derived from the first order Markov model, that is each P(s_t,t), t>1 is set to the
transition rate p(s_t)|s_t-1 estimated across sequences from the observations at positions t
and t-1. For t=1, we set P(s_1,1) to the observed frequency of the state s_1 at position 1.
The likelihood P(s) being generally very small, we use -log P(s) as sorting criterion. The latter quantity is
minimal when P(s) is equal to 1, which leads to sort the sequences in ascending order of their score.
For more details, see Gabadinho et al., 2009.
An object of class stslist.rep
. This is actually a state sequence object (containing a list of state
sequences) with the following additional attributes:
Scores |
a vector with the representative score of each sequence in the original set given the chosen criterion. |
Distances |
a matrix with the distance of each sequence to its nearest representative. |
Statistics |
contains several quality measures for each representative sequence in the set: number of sequences attributed to the representative, number of sequence in the representatives neighborhood, mean distance to the representative. |
Quality |
overall quality measure. |
Print, plot and summary methods are available. More elaborated plots are produced by the seqplot
function using the
type="r"
argument, or the seqrplot
alias.
Gabadinho, A., G. Ritschard, M. Studer and N. S. Müller (2009). Summarizing Sets of Categorical Sequences, In International Conference on Knowledge Discovery and Information Retrieval, Madeira, 6-8 October, INSTICC.
## Defining a sequence object with the data in columns 10 to 25 ## (family status from age 15 to 30) in the biofam data set data(biofam) biofam.lab <- c("Parent", "Left", "Married", "Left+Marr", "Child", "Left+Child", "Left+Marr+Child", "Divorced") biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab) ## Computing the distance matrix costs <- seqsubm(biofam.seq, method="TRATE") biofam.om <- seqdist(biofam.seq, method="OM", sm=costs) ## Representative set using the neighborhood density criterion biofam.rep <- seqrep(biofam.seq, dist.matrix=biofam.om, criterion="density") biofam.rep summary(biofam.rep) plot(biofam.rep)