anneal {subselect} | R Documentation |
Given a set of variables, a Simulated Annealing algorithm seeks a k-variable subset which is optimal, as a surrogate for the whole set, with respect to a given criterion.
anneal( mat, kmin, kmax = kmin, nsol = 1, niter = 1000, exclude = NULL, include = NULL, improvement = TRUE, setseed = FALSE, cooling = 0.05, temp = 1, coolfreq = 1, criterion = "RM", pcindices = "first_k", initialsol=NULL, force=FALSE, tolval=10*.Machine$double.eps)
mat |
a covariance or correlation matrix of the variables from which the k-subset is to be selected. |
kmin |
the cardinality of the smallest subset that is wanted. |
kmax |
the cardinality of the largest subset that is wanted. |
nsol |
the number of initial/final subsets (runs of the algorithm). |
niter |
the number of iterations of the algorithm for each initial subset. |
exclude |
a vector of variables (referenced by their row/column
numbers in matrix mat ) that are to be forcibly excluded from
the subsets. |
include |
a vector of variables (referenced by their row/column
numbers in matrix mat ) that are to be forcibly included in
the subsets. |
improvement |
a logical variable indicating whether or not the
best final subset (for each cardinality) is to be passed as input to a
local improvement algorithm (see function improve ). |
setseed |
logical variable indicating whether to fix an initial seed for the random number generator, which will be re-used in future calls to this function whenever setseed is again set to TRUE. |
cooling |
variable in the ]0,1[ interval indicating the rate of geometric cooling for the Simulated Annealing algorithm. |
temp |
positive variable indicating the initial temperature for the Simulated Annealing algorithm. |
coolfreq |
positive integer indicating the number of iterations of the algorithm between coolings of the temperature. By default, the temperature is cooled at every iteration. |
criterion |
Character variable, which indicates which criterion
is to be used in judging the quality of the subsets. Currently, only
the RM, RV and GCD criteria are supported, and referenced as "RM",
"RV" or "GCD" (see References, rm.coef ,
rv.coef and gcd.coef for further
details). |
pcindices |
either a vector of ranks of Principal Components that are to be
used for comparison with the k-variable subsets (for the GCD
criterion only, see gcd.coef ) or the default text
first_k . The latter will associate PCs 1 to k with each
cardinality k that has been requested by the user. |
initialsol |
vector, matrix or 3-d array of initial solutions
for the simulated annealing search. If a single cardinality is
required, initialsol may be a vector of length k, in
which case it is used as the initial solution for all nsol
final solutions that are requested; a 1 x k matrix (as
produced by the $bestsets output value of the algorithm functions
anneal , genetic , or improve ), or
a 1 x k x 1 array (as produced by the
$subsets output value), in
which case it will be treated as the above k-vector; or an
nsol x k matrix, or nsol x k x 1 3-d
array, in which case each row (dimension 1) will be used
as the initial solution for each of the nsol final solutions
requested. If more than one cardinality is requested,
initialsol can be a
length(kmin:kmax) x kmax matrix (as produced by the
$bestsets option of the algorithm functions), in which case
each row will be replicated to produced the initial solution for all
nsol final solutions requested in each cardinality, or a
nsol x kmax x length(kmin:kmax) 3-d array (as
produced by the
$subsets output option), in which case each row (dimension 1)
is interpreted as a different initial solution.
If the exclude and/or include options are used,
initialsol must also respect those requirements. |
force |
a logical variable indicating whether, for large data
sets (currently p > 400) the algorithm should proceed
anyways, regardless of possible memory problems which may crash the
R session. |
tolval |
the tolerance level for the reciprocal of the 2-norm condition number of the correlation/covariance matrix, i.e., for the ratio of the smallest to the largest eigenvalue of the input matrix. Matrices with a reciprocal of the condition number smaller than tolval will abort the search algorithm. |
An initial k-variable subset (for k ranging from kmin
to kmax
)
of a full set of p variables is randomly
selected and passed on to a Simulated Annealing algorithm.
The algorithm then selects a random subset in the neighbourhood of the
current subset (neighbourhood of a subset S being defined as the
family of all k-variable subsets which differ from S by a
single variable), and decides whether to replace the current subset
according to the Simulated Annealing rule, i.e., either (i) always,
if the alternative subset's value of the criterion is higher; or (ii) with
probability exp((ac-cc)/t)
if the alternative subset's value of the
criterion (ac) is lower than that of the current solution (cc), where
the parameter t (temperature) decreases throughout the
iterations of the algorithm. For each
cardinality k, the stopping criterion for the
algorithm is the number of iterations (niter
) which is controlled by the
user. Also controlled by the user are the initial temperature (temp
) the
rate of geometric cooling of the temperature (cooling
) and the
frequency with which the temperature is cooled, as measured by
coolfreq
, the number of iterations after which the temperature is
multiplied by 1-cooling
.
Optionally, the best k-variable subset produced by Simulated Annealing may be passed as input to a restricted local search algorithm, for possible further improvement.
The user may force variables to be included and/or excluded from the k-subsets, and may specify initial solutions.
For each cardinality k, the total number of calls to the procedure
which computes the criterion
values is nsol
x (niter
+ 1). These calls are the
dominant computational effort in each iteration of the algorithm.
In order to improve computation times, the bulk of computations is
carried out by a Fortran routine. Further details about the Simulated
Annealing algorithm can
be found in Reference 1 and in the comments to the Fortran code (in
the src
subdirectory for this package). For datasets with a very
large number of variables (currently p > 400), it is
necessary to set the force
argument to TRUE for the function to run, but this may cause a session crash if there is not enough memory available.
The function checks for ill-conditioning of the input matrix
(specifically, it checks whether the ratio of the input matrix's
smallest and largest eigenvalues is less than tolval
). For an
ill-conditioned input matrix, execution is aborted. The function
trim.matrix
may be used to obtain a well-conditioned input
matrix.
A list with five items:
subsets |
An nsol x kmax x
length(kmin :kmax ) 3-dimensional
array, giving for each cardinality (dimension 3) and each solution
(dimension 1) the list of variables (referenced by their row/column
numbers in matrix mat ) in the subset (dimension 2). (For
cardinalities smaller than kmax , the extra final positions are set
to zero). |
values |
An nsol x length(kmin :kmax )
matrix, giving for each cardinality (columns), the criterion values
of the nsol (rows) subsets obtained. |
bestvalues |
A length(kmin :kmax ) vector giving
the best values of the criterion obtained for each cardinality. If
improvement is TRUE, these values result from the final
restricted local search algorithm (and may therefore exceed the
largest value for that cardinality in values ). |
bestsets |
A length(kmin :kmax ) x kmax
matrix, giving, for each cardinality (rows), the variables
(referenced by their row/column numbers in matrix mat ) in
the best k-subset that was found. |
call |
The function call which generated the output. |
1) Cadima, J., Cerdeira, J. Orestes and Minhoto, M. (2004) Computational aspects of algorithms for variable selection in the context of principal components. Computational Statistics & Data Analysis, 47, 225-236.
2) Cadima, J. and Jolliffe, I.T. (2001). Variable Selection and the Interpretation of Principal Subspaces, Journal of Agricultural, Biological and Environmental Statistics, Vol. 6, 62-79.
rm.coef
, rv.coef
,
gcd.coef
, genetic
, improve
, leaps
, trim.matrix
.
# For illustration of use, a small data set with very few iterations # of the algorithm. data(swiss) anneal(cor(swiss),2,3,nsol=4,niter=10,criterion="RM") ##$subsets ##, , Card.2 ## ## Var.1 Var.2 Var.3 ##Solution 1 3 6 0 ##Solution 2 4 5 0 ##Solution 3 1 2 0 ##Solution 4 3 6 0 ## ##, , Card.3 ## ## Var.1 Var.2 Var.3 ##Solution 1 4 5 6 ##Solution 2 3 5 6 ##Solution 3 3 4 6 ##Solution 4 4 5 6 ## ## ##$values ## card.2 card.3 ##Solution 1 0.8016409 0.9043760 ##Solution 2 0.7982296 0.8769672 ##Solution 3 0.7945390 0.8777509 ##Solution 4 0.8016409 0.9043760 ## ##$bestvalues ## Card.2 Card.3 ##0.8016409 0.9043760 ## ##$bestsets ## Var.1 Var.2 Var.3 ##Card.2 3 6 0 ##Card.3 4 5 6 ## ##$call ##anneal(cor(swiss), 2, 3, nsol = 4, niter = 10, criterion = "RM") # # # Excluding variable number 6 from the subsets. # data(swiss) anneal(cor(swiss),2,3,nsol=4,niter=10,criterion="RM",exclude=c(6)) ##$subsets ##, , Card.2 ## ## Var.1 Var.2 Var.3 ##Solution 1 4 5 0 ##Solution 2 4 5 0 ##Solution 3 4 5 0 ##Solution 4 4 5 0 ## ##, , Card.3 ## ## Var.1 Var.2 Var.3 ##Solution 1 1 2 5 ##Solution 2 1 2 5 ##Solution 3 1 2 5 ##Solution 4 1 4 5 ## ## ##$values ## card.2 card.3 ##Solution 1 0.7982296 0.8791856 ##Solution 2 0.7982296 0.8791856 ##Solution 3 0.7982296 0.8791856 ##Solution 4 0.7982296 0.8686515 ## ##$bestvalues ## Card.2 Card.3 ##0.7982296 0.8791856 ## ##$bestsets ## Var.1 Var.2 Var.3 ##Card.2 4 5 0 ##Card.3 1 2 5 ## ##$call ##anneal(cor(swiss), 2, 3, nsol = 4, niter = 10, criterion = "RM", ## exclude=c(6)) # specifying initial solutions: using the subsets produced by # simulated annealing for one criterion (RM, by default) as initial # solutions for the simulated annealing search with a different criterion. data(swiss) rmresults<-anneal(cor(swiss),2,3,nsol=4,niter=10, setseed=TRUE) anneal(cor(swiss),2,3,nsol=4,niter=10,criterion="gcd", initialsol=rmresults$subsets) ##$subsets ##, , Card.2 ## ## Var.1 Var.2 Var.3 ##Solution 1 3 6 0 ##Solution 2 3 6 0 ##Solution 3 3 6 0 ##Solution 4 3 6 0 ## ##, , Card.3 ## ## Var.1 Var.2 Var.3 ##Solution 1 4 5 6 ##Solution 2 4 5 6 ##Solution 3 3 4 6 ##Solution 4 4 5 6 ## ## ##$values ## card.2 card.3 ##Solution 1 0.8487026 0.925372 ##Solution 2 0.8487026 0.925372 ##Solution 3 0.8487026 0.798864 ##Solution 4 0.8487026 0.925372 ## ##$bestvalues ## Card.2 Card.3 ##0.8487026 0.9253720 ## ##$bestsets ## Var.1 Var.2 Var.3 ##Card.2 3 6 0 ##Card.3 4 5 6 ## ##$call ##anneal(cor(swiss), 2, 3, nsol = 4, niter = 10, criterion = "gcd", ## initialsol = rmresults$subsets)