TMC {CombMSC} | R Documentation |
Computes convex combinations of model selection criteria. The function is very customizable, allowing the user to specify what type of model is to be tested, which criteria are to be used, and many other options described below.
TMC(num.Iter = 50, data.Size = 100, make.Data = gen.Data, make.Params = gen.Params, model.List, weight.Vector = rep(1, times = length(model.List)), msc.List, fit.Model = fit.Models, stepSize = 0.05, sumstats = list("Median Rank" = median), huge = FALSE, var.Frame = data.frame(), par.Sigma = 1, data.Sigma = 1, barebones = FALSE, allow.Negs = FALSE, thresholds = c(1, 2, 3, 5, 10), test.Size = 0, scale.Frame = TRUE, use.Ranks = TRUE, ...)
num.Iter |
The number of iterations. This will be the total number of times that the entire loop described in the Details section will be executed. |
data.Size |
For time-series (and possibly other extended types), the size of each simulated data set. |
make.Data |
The name (not quoted) of a function used to simulate data. Must take the results of make.Params as only argument. A sensible default is gen.Data for time series and regression. |
make.Params |
The name (not quoted) of a function used to simulate parameters. Must take a single model as its only argument. A sensible default is gen.Params for time series and regression. |
model.List |
A list of candidate models. The true model will be chosen from this list in each iteration, and the MSC values of every model in this list will then be calculated, from which the rank of the true model is computed. Utility functions for constructing such model lists are make.Model.List.Reg and make.Model.List.TS . |
weight.Vector |
A numeric vector, the same length as model.List, of the weights (probabilities) of each model. Used to choose the true model at each iteration. Need not be scaled, but must be nonnegative. To construct a vector of weights for individual models based on a prior distribution on the number of terms (or complexity) of the underlying model, use weightsGivenSize . Another possible utility function, which weights models of only a specified size, is weight.Only.N . |
msc.List |
A list of model selection criterion functions. The length must be more than 1, but should not be much larger than 3 to avoid computational overflow. The recommended number of MSCs is 3. Each function must take a fitted model object (produced by fit.Model) as its only argument. Commonly used functions include AIC , BIC , and for time series models, holdout.Mean for mean absolute deviation on a holdout sample, and holdout.Med for the median absolute deviation. This list, however, is by no means exhaustive and new MSC functions can easily be written – see details below. |
fit.Model |
The function used to fit the models defined by model.List. Whenever possible, we recommend that this be a built-in R function, e.g., lm or arima. |
stepSize |
The mesh of the grid of convex combinations. Bear in mind the number of convex combinations will be roughly proportional to (1/stepSize)^length(msc.List), so don't make stepSize too small, especially if msc.List is longer than 3! |
sumstats |
The summary functions of the distributions of ranks. Used for graphical displays of the final msc object. Note that the average and also all the summary functions generated by thresholds (see below) are automatically included in the final object, so there is no need to put them in this list. |
huge |
Required to be set to TRUE if the matrix of convex combinations will be larger than roughly 500000. To avoid unexpectedly long calculations. |
var.Frame |
For models with covariates, this should be the data.frame containing them. For other models, it is ignored. |
par.Sigma |
This argument may be passed to make.Params, and is the standard deviation used in gen.Params.lmFormula , for example. |
data.Sigma |
An optional argument to be passed to make.Data. |
barebones |
For large computations, we recommend this be set to TRUE. It will throw away the individual ranks at each iteration, updating only the summary functions, in order to reduce space requirements. If barebones is TRUE, summary.Functions is restricted to pre-defined functions which can be updated dynamically, such as mean, and cannot include functions which require the whole sample, such as median. |
allow.Negs |
If TRUE, the matrix of convex combinations will be expanded to include linear combinations with negative weights. Greatly increases computation, and is rarely helpful. |
thresholds |
Must be a numeric vector. Included as a simple way to generate summary functions — for each element k of this vector, the summary function P(Rank > k) will be computed and included in the final object. Note that if barebones is set to TRUE, the elements of thresholds are the ONLY summary functions the user can specify (this must be enforced to ensure that the barebones routine does not need to keep track of all the ranks from individual iterations, but instead can retain only the updated summary function values. |
test.Size |
The size of the subset of each sample to be used as a holdout sample. Ordinarily, this is set to 0, but for certain MSCs, namely those whose names begin with "holdout", it needs to be set to a nonzero number to be useful. A common rule of thumb is to set the size to be roughly ten percent of the total sample size. Note, however, that whenever this argument is nonzero, the function will slow down considerably, since it is then forced to fit all models twice (once with the full sample, once with only the "training" sample, without the holdout sample included.) |
scale.Frame |
Logical indicating whether var.Frame should be scaled first. If true, each column will be centered by its mean and divided by its standard deviation. |
use.Ranks |
Logical. If TRUE, then in each iteration, the msc values for each criterion will be scaled by taking ranks. If FALSE, then they will be scaled by standardizing instead. |
... |
Other arguments to be passed to other functions. |
The basic algorithm is as follows:
After these steps have been iterated num.Iter times, the summary functions specified in sumstats, as well as the average and threshold functions defined by thresholds, are computed for each convex combination.
New model selection functions, or additional methods for existing ones, can easily be written. The object initially passed
to each such function will be of class "fmo", a class used internally in TMC
. An fmo object will contain at least the components
full
fit.Model
to the full data set generated by gen.Data
train
fit.Model
to only the training part of the data set (that is, the data set less any observations held out for msc functions involving a holdout sample.) If test.Size
= 0, this is NULL.test.Frame
test.Size
= 0, this is NULL.test.Vector
test.Size
= 0, this is NULL.S2
Cp
to avoid recalculating for every criterion.
Thus, to write a new model selection criterion function, one should create a generic function with a method for class "fmo", and further methods for whatever classes of model objects for which one can actually compute the criterion directly. The method for class "fmo" is typically very simple, and usually involves calling another method of the same function on some part of the fmo object itself, typically the full
component for ordinary model selection criteria or the train
component for criteria involving a holdout sample. For example, see PRESS
.
gen.Data
, gen.Params
, and fit.Models
are intended to be sensible defaults, but they certainly need not be the only functions one uses for simulating parameters, data, and fitting models. New methods can easily be written for all three such functions. It is recommended that, to do this, one creates a new class, create a list of model specifications (e.g., model formulae or order specifications) of this new class, and then write methods for gen.Params
, etc. for this new class.
An object of class msc, or an object of class barebones, which inherits from msc, if barebones is TRUE. Contains the following components:
call |
The matched call |
Sum.Stats |
A data.frame, with each row representing a convex combination of MSCs. The first 3 columns give the weights corresponding to the combination, and the remaining columns give the values of all summary statistics corresponding to the combination. |
var.Frame |
For models containing covariates, a data.frame containing them. |
error.Iterations |
Iteration numbers in which the attempt to fit the true model to the simulated data set resulted in an error, thus making it impossible to compute a rank. |
num.Errors |
The length of error.Iterations. |
time.Taken |
The total length of time to complete the call. |
simulated.Models |
The formula corresponding to the true model chosen in each iteration. |
simulation.Attempts |
The number of attempts needed, during each iteration, to simulate data successfully. Mainly useful for diagnostic purposes when simulation of time series results in non-stationary data. |
ranks.Mat |
A matrix containing the ranks corresponding to each combination for every iteration. One can use this, for example, to calculate the values of new summary functions. |
simulated.Data |
A list of data vectors simulated at each iteration. |
simulated.Parameters |
A list of vectors containing the simulated parameters from each iteration. |
simulated.Models |
A list of the actual models chosen (from the prior given by weight.Vector ). Each will be an element of model.List . |
Plus several other components which are taken directly from the call, for convenience in later processing.
Andrew K. Smith
A more complete description of the algorithm used, as well as a discussion of its properties and illustrations of its potential utility, can be found at http://www.isye.gatech.edu/~asmith/combmsc.pdf.
# Regression example vars <- rnorm(60) dim(vars)<- c(20,3) vars <- data.frame(vars) result <- TMC(num.Iter = 3, model.List = make.Model.List.Reg(vars), msc.List = list(BIC, AIC, PRESS), var.Frame = vars) # Time Series Example modList <- make.Model.List.TS(c(1,0,1,0,0,1)) result2 <- TMC(num.Iter = 3,model.List = modList, msc.List = list(BIC, holdout.Mean, AIC), test.Size = 10)