earth {earth} | R Documentation |
Build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS".
## S3 method for class 'formula': earth(formula = stop("no 'formula' arg"), data, weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL, na.action = na.fail, glm = NULL, trace = 0, keepxy = FALSE, nfold=0, stratify=TRUE, ...) ## Default S3 method: earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"), weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL, na.action = na.fail, glm = NULL, trace = 0, keepxy = FALSE, nfold=0, stratify=TRUE, ...) ## S3 method for class 'fit': earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"), weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL, na.action = na.fail, glm = NULL, trace = 0, nk = max(21, 2 * ncol(x) + 1), degree = 1, penalty = if(degree > 1) 3 else 2, thresh = 0.001, minspan = 0, newvar.penalty = 0, fast.k = 20, fast.beta = 1, linpreds = FALSE, allowed = NULL, pmethod = "backward", nprune = NULL, Object = NULL, Get.crit = get.gcv, Eval.model.subsets = eval.model.subsets, Print.pruning.pass = print.pruning.pass, Force.xtx.prune = FALSE, Use.beta.cache = TRUE, ...)
To start off, look at the arguments
formula
,
data
,
x
,
y
,
nk
, and
degree
.
Many users will find that those arguments are all they need,
plus in some cases
keepxy
,
nprune
,
penalty
,
minspan
, and
trace
.
For GLM models, use the glm
argument.
For cross validation, use the nfold
argument.
formula |
Model formula. |
data |
Data frame for formula .
|
x |
Matrix or dataframe containing the independent variables. |
y |
Vector containing the response variable, or, in the case of multiple responses, a matrix or dataframe whose columns are the values for each response. |
subset |
Index vector specifying which cases to use, i.e., which rows in x to use.
Default is NULL, meaning all.
|
weights |
Weight vector (not yet supported). |
wp |
Vector of response weights.
Default is NULL, meaning no response weights.
If specified, wp must have an element for each column of
y (after factors , if any, have been expanded).Note for mda::mars users:
earth's internal normalization of wp is different from mars .
Earth uses wp<-sqrt(wp/mean(wp))
and mars uses wp<-sqrt(wp/sum(wp)) .
Thus in earth, a wp with all elements equal is equivalent to no wp .
For models built with wp , multiply the GCV
calculated by mars by length(wp) to compare it to earth's GCV.
|
scale.y |
Scale y in the forward pass for better numeric stability
[added Jan 2009].
Scaling here means subtract the mean and divide by the standard deviation.
Default is NCOL(y)==1 ,
i.e., scale y unless y has multiple columns.
|
na.action |
NA action. Default is na.fail , and only na.fail is supported.
(Why? Because adding support to earth for other NA actions is easy, but making
sure that is handled correctly internally in predict , plotmo , etc. is tricky.
It is more reliable to make the user remove NAs before calling earth.)
|
glm |
NULL (default) or a list of arguments to glm .
See the "Generalized linear models" section below. Example:earth(y~x, glm=list(family=binomial))
|
trace |
Trace earth's execution. Default is 0. Values:
0 none .5 cross validation 1 overview 2 forward pass 3 pruning 4 model mats, memory use, more pruning, etc. 5 ... |
keepxy |
Set to TRUE to retain the following in the returned value: x and y (or data ),
subset , and weights .
Default is FALSE.
The function update.earth and friends will use these instead of searching for them
in the environment at the time update.earth is invoked.
This argument also affects the amount of data kept when
the nfold argument is used
(see cv.list in the "Value" section below).The following arguments are for the forward pass. |
nk |
Maximum number of model terms before pruning.
Includes the intercept.
Default is max(21,2*NCOL(x)+1) .
The number of terms created by the forward pass will be
less than nk if a forward stopping condition is reached before nk terms,
or if the forward pass drops one side of a hinge pair to prevent linearly dependencies.
See the "Forward pass" section below.
|
degree |
Maximum degree of interaction (Friedman's mi). Default is 1, meaning build an additive model (i.e., no interaction terms). |
penalty |
Generalized Cross Validation (GCV) penalty per knot.
Default is if(degree>1) 3 else 2 .
A value of 0 penalizes only terms, not knots.
The value -1 is treated specially to mean no penalty, so GCV=RSS/n.
Theory suggests values in the range of about 2 to 4.
In practice, for big data sets larger values can be useful to force a smaller model.
The FAQ section below has some information on GCVs.
|
thresh |
Forward stepping threshold. Default is 0.001. This is one of the arguments used to decide when forward stepping should terminate. See the "Forward pass" section below. |
minspan |
Minimum distance between knots. Use a value of 1 to consider all x values (which is good if the data are not noisy). The default value is 0. The value 0 is treated specially and means calculate the minspan internally as per
Friedman's MARS paper section 3.8 with alpha = 0.05.
Set trace>=2 to see the calculated value.This argument is intended to increase resistance to runs of noise in the input data. Higher values will increase smoothness in your model. Note: predictor value extremes are ineligible for knots regardless of the minspan setting, as per the MARS paper equation 45.
|
newvar.penalty |
Penalty for adding a new variable in the forward pass (Friedman's gamma, equation 74 in the MARS paper). Default is 0, meaning no penalty for adding a new variable. Useful non-zero values range from about 0.01 to 0.2 — you will need to experiment. This argument can mitigate the effects of collinearity or concurvity in the input data, but anecdotal evidence is that it does not work very well. If you know two variables are strongly correlated then you would do better to delete one of them before calling earth. |
fast.k |
Maximum number of parent terms considered at each step of the forward pass.
Friedman invented this parameter to speed up the forward pass
(see the Fast MARS paper section 3.0).
Default is 20.
Values of 0 or less are treated specially
(as being equivalent to infinity), meaning no Fast MARS.
Typical values, apart from 0, are 20, 10, or 5.
In general, with a lower fast.k (say 5), earth is faster;
with a higher fast.k , or with fast.k disabled (set to 0),
earth builds a better model.
However it is not unusual to get a slightly better model with a lower fast.k ,
and you may need to experiment.
|
fast.beta |
Fast MARS aging coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results. |
linpreds |
Index vector specifying which predictors should enter linearly, as in lm .The default is FALSE, meaning all predictors enter in the standard MARS fashion, i.e., in hinge functions. A predictor's index in linpreds is the column number in the input matrix x
after factors have been expanded.
Examples are given in the FAQ section below.Note: in the current implementation, the GCV penalty for predictors that enter linearly is the same as that for predictors with knots. That is not quite correct; linear terms should be penalized less. |
allowed |
Function specifying which predictors can interact and how.
Default is NULL, meaning all standard MARS terms are allowed. Earth calls the allowed function just before adding a term
in the forward pass.
If allowed returns TRUE the term goes into the model as usual;
if allowed returns FALSE the term is discarded.
Examples are given in the FAQ section below.Your allowed function should have the following prototypeallowed <- function(degree, pred, parents, namesx, first) {.....} where degree is the interaction degree of the candidate term.
Will be 1 for additive terms.pred is the index of the candidate predictor.
A predictor's index in linpreds is the column number in the input matrix x
after factors have been expanded.parents is the candidate parent term's row in dirs .namesx is optional and if present is the column names of x after factors have been expanded.first is optional and if present is TRUE the first time your allowed function is invoked for
the current model, and thereafter FALSE.
|
nfold |
Number of cross validation folds.
Default is 0, i.e., no cross validation.
If greater than 1, earth first builds a standard model as usual, with all the data.
It then builds nfold cross validated models,
measuring R-Squared on the left-out data each time.
The final cross validation R-Squared is the mean of these R-Squareds.
If a binomial or poisson model (see the glm argument),
then further statistics are calculated.
See the "Cross validation" section below for details.
|
stratify |
Only applies if nfold>1 .
Default is TRUE.
Stratify the cross validation samples so that, for each column of the response
y (after factors have been expanded),
an approximately equal number of cases with a non-zero response
occur in each cross validation subset.
That means that if y is a factor, there will be
approximately equal numbers of each factor level in each fold
(see the "Factors" section below).
We say "approximately equal" because the number of occurrences of a factor
level may not be exactly divisible by the number of folds.
The following arguments are for the pruning pass. |
pmethod |
Pruning method.
Default is "backward" .
One of: backward none exhaustive forward seqrep .
Use none to retain all the terms created by the forward pass.
If y has multiple columns, then only backward or none
is allowed.
Pruning can take a while if exhaustive is chosen and
the model is big (more than about 30 terms).
The current version of the leaps package
used during pruning does not allow user interrupts
(i.e., you have to kill your R session to interrupt;
in Windows hit Ctl-Alt-Delete or from the command line use tskill ).
|
nprune |
Maximum number of terms (including intercept) in the pruned model.
Default is NULL, meaning all terms created by the forward pass
(but typically not all terms will remain after pruning).
Use this to reduce exhaustive search time, or
to enforce an upper bound on the model size. The following arguments are for internal or advanced use. |
Object |
Earth object to be updated, for use by update.earth .
|
Get.crit |
Criterion function for model selection during pruning. By default a function that returns the GCV. See the "Pruning pass" section below. |
Eval.model.subsets |
Function to evaluate model subsets — see notes in source code. |
Print.pruning.pass |
Function to print pruning pass results — see notes in source code. |
Force.xtx.prune |
Default is FALSE.
This argument pertains to subset evaluation in the pruning pass.
By default,
if y has a single column then earth calls the leaps routines;
if y has multiple columns then earth calls EvalSubsetsUsingXtx .
The leaps routines are more accurate but do not support multiple responses
(leaps is based on the QR decomposition and
EvalSubsetsUsingXtx is based on the inverse of X'X).
Setting Force.xtx.prune=TRUE forces use of EvalSubsetsUsingXtx , even
if y has a single column.
|
Use.beta.cache |
Default is TRUE.
Using the "beta cache" takes more memory but is faster
(by 20% and often much more for large models).
The beta cache uses nk * nk * ncol(x) * sizeof(double) bytes.
Set Use.beta.cache=FALSE to save memory.
(The beta cache is an innovation in this implementation of MARS
and does not appear in Friedman's papers. It is not related to
the fast.beta argument.)
|
... |
Dots are passed on to earth.fit .
|
An object of class "earth" which is a list with the components
listed below.
Term refers to a term created during the forward pass
(each line of the output from format.earth
is a term).
Term number 1 is always the intercept.
rss |
Residual sum-of-squares (RSS) of the model (summed over all responses
if y has multiple columns).
|
rsq |
1-rss/rss.null .
R-Squared of the model (calculated over all responses).
A measure of how well the model fits the training data.
Note that rss.null is sum((y - mean(y))^2) .
|
gcv |
Generalized Cross Validation (GCV) of the model (summed over all responses)
The GCV is calculated using the penalty argument.
For details of the GCV calculation, see
equation 30 in Friedman's MARS paper and earth:::get.gcv .
|
grsq |
1-gcv/gcv.null .
An estimate of the predictive power of the model (calculated over all responses).
Unlike rsq , in MARS models grsq can be negative.
A negative grsq would indicate
a severely over parameterized model — a model that
would not generalize well
even though it may be a good fit to the training data.
Watch the GRSq take a nose dive in this example:earth(mpg~., data=mtcars, pmethod="none", trace=3)
|
bx |
Matrix of basis functions applied to x .
Each column corresponds to a selected term.
Each row corresponds to a row in in the input matrix x ,
after taking subset .
See model.matrix.earth for an example of bx handling.
Example bx :(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(... [1,] 1 0.0 4.6 0 [2,] 1 0.0 4.3 0 [3,] 1 0.0 4.1 0 ... |
dirs |
Matrix with one row per MARS term, and with with ij-th element equal to0 if predictor j is not in term i-1 if an expression of the form pmax(const - xj) is in term i1 if an expression of the form pmax(xj - const) is in term i2 if predictor j enters term i linearly.This matrix includes all terms generated by the forward.pass, including those not in selected.terms .
Note that the terms may not be in pairs, because the forward pass
deletes linearly dependent terms before handing control to the pruning pass.
Example dirs :Girth Height (Intercept) 0 0 #intercept h(Girth-12.9) 1 0 #2nd term uses Girth h(12.9-Girth) -1 0 #3rd term uses Girth h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height ... |
cuts |
Matrix with ij-th element equal to the cut point
for predictor j in term i.
This matrix includes all terms generated by the forward.pass,
including those not in selected.terms .
Note that the terms may not be in pairs, because the forward pass
deletes linearly dependent terms before handing control to the pruning pass.
Note for programmers: the precedent is to use dirs
for term names etc. and to only use cuts where cut information needed.
Example cuts :Girth Height (Intercept) 0 0 #intercept, no cuts h(Girth-12.9) 12.9 0 #2nd term has cut at 12.9 h(12.9-Girth) 12.9 0 #3rd term has cut at 12.9 h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts ... |
selected.terms |
Vector of term numbers in the best model.
Can be used as a row index vector into cuts and dirs .
The first element selected.terms[1] is always 1, the intercept.
|
prune.terms |
A matrix specifying which terms appear in which pruning pass subsets.
The row index of prune.terms is the model size.
(The model size is the number of terms in the model.
The intercept is considered to be a term.)
Each row is a vector of term numbers for the best model of that size.
An element is 0 if the term is not in the model, thus prune.terms is a
lower triangular matrix, with dimensions nprune x nprune .
The model selected by the pruning pass is at row number length(selected.terms) .
Example prune.terms :[1,] 1 0 0 0 0 0 0 #intercept-only model [2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2 [3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4 [4,] 1 2 6 9 0 0 0 #and so on ... |
rss.per.response |
A vector of the RSS for each response.
Length is the number of responses, i.e., ncol(y) after factors in y have been expanded.
The rss component above is equal to sum(rss.per.response) .
|
rsq.per.response |
A vector of the R-Squared for each response. Length is the number of responses. |
gcv.per.response |
A vector of the GCV for each response.
Length is the number of responses.
The gcv component above is equal to sum(gcv.per.response) .
|
grsq.per.response |
A vector of the GRSq for each response. Length is the number of responses. |
rss.per.subset |
A vector of the RSS for each model subset generated by the pruning pass.
Length is nprune .
For multiple responses, the RSS is summed over all responses for each subset.
The null RSS (i.e., the RSS of an intercept only-model) is rss.per.subset[1] .
The rss above isrss.per.subset[length(selected.terms)] .
|
gcv.per.subset |
A vector of the GCV for each model in prune.terms .
Length is nprune .
For multiple responses, the GCV is summed over all responses for each subset.
The null GCV (i.e., the GCV of an intercept-only model) is gcv.per.subset[1] .
The gcv above is gcv.per.subset[length(selected.terms)] .
|
fitted.values |
Fitted values.
A matrix with dimensions nrow(y) x ncol(y)
after factors in y have been expanded.
|
residuals |
Residuals.
A matrix with dimensions nrow(y) x ncol(y)
after factors in y have been expanded.
|
coefficients |
Regression coefficients.
A matrix with dimensions length(selected.terms) x ncol(y)
after factors in y have been expanded.
Each column holds the least squares coefficients from regressing that
column of y on bx .
The first row holds the intercept coefficient(s).
|
penalty |
The GCV penalty used during pruning.
A copy of earth 's penalty argument.
|
call |
The call used to invoke earth .
|
terms |
Model frame terms.
This component exists only if the model was built using earth.formula .
|
namesx |
Column names of x , generated internally by earth when necessary
so each column of x has a name.
Used, for example, by predict.earth to name columns if necessary.
|
namesx.org |
Original column names of x .
|
levels |
Levels of y if y is a factor c(FALSE,TRUE) if y is logical Else NULL |
wp |
Copy of the wp argument to earth.The following fields appear only if earth's argument keepxy is TRUE.
|
x |
|
y |
|
data |
|
subset |
|
weights |
Copies of the corresponding arguments to earth.
Only exist if keepxy=TRUE .The following fields appear only if earth's glm argument is used.
|
glm.list |
List of GLM models. Each element is the value returned by earth's
internal call to glm for each response.Thus if there is a single response (or a single binomial pair, see the "Binomial pairs" section below) this will be a one element list and you access the GLM model with my.earth.model$glm.list[[1]]. |
glm.coefficients |
GLM regression coefficients.
Analogous to the coefficients field described above but for the GLM model(s).
A matrix with dimensions length(selected.terms) x ncol(y)
after factors in y have been expanded.
Each column holds the coefficients from the GLM regression of that
column of y on bx .
This duplicates, for convenience, information in glm.list .
|
glm.bpairs |
NULL unless there are paired binomial columns.
A logical vector, derived internally by earth, or a copy
the bpairs specified by the user in the glm list.
See the "Binomial pairs" section below.The following fields appear only if the nfold argument is greater than 1.
|
cv.rsq.tab |
Matrix with nfold+1 rows and nresponse+1 columns,
where nresponse is the number of responses,
i.e., ncol(y) after factors in y have been expanded.
The first nresponse elements of a row are the RSq's on
the left-out data for each response of the model generated at that row's fold.
The final column holds the row mean (a weighted mean if wp if specified).
The final row of the table holds the column means.
The values in this final row are the CV-RSqs printed by summary.earth.
Example for a single response model: y mean fold 1 0.909 0.909 fold 2 0.869 0.869 fold 3 0.952 0.952 fold 4 0.157 0.157 fold 5 0.961 0.961 mean 0.769 0.769Example for a multiple response model: y1 y2 y3 mean fold 1 0.915 0.951 0.944 0.937 fold 2 0.962 0.970 0.970 0.968 fold 3 0.914 0.940 0.942 0.932 fold 4 0.907 0.929 0.925 0.920 fold 5 0.947 0.987 0.979 0.971 mean 0.929 0.955 0.952 0.946 |
cv.maxerr.tab |
Like cv.rsq.tab but is the MaxErr at each fold.
This is the signed max absolute value at each fold.
Also, results are aggregrated using the signed max absolute value instead of the mean.
The signed max absolute value is the maximum of the absolute difference
between the predicted and observed response values, multiplied
by -1 if the sign of the difference is negative.
|
cv.deviance.tab |
Like cv.rsq.tab but is the MeanDev at each fold.
Binomial models only.
|
cv.calib.int.tab |
Like cv.rsq.tab but is the CalibInt at each fold.
Binomial models only.
|
cv.calib.slope.tab |
Like cv.rsq.tab but is the CalibSlope at each fold.
Binomial models only.
|
cv.auc.tab |
Like cv.rsq.tab but is the AUC at each fold.
Binomial models only.
|
cv.cor.tab |
Like cv.rsq.tab but is the cor at each fold.
Poisson models only.
|
cv.nterms |
Vector of length nfold +1.
Number of MARS terms in the model generated at each cross validation fold,
with the final element being the mean of these.
|
cv.nvars |
Vector of length nfold +1.
Number of predictors in the model generated at each cross validation fold,
with the final element being the mean of these.
|
cv.groups |
Specifies which cases went into which folds.
Vector of length equal to the number of cases, with elements taking values in 1:nfold .
|
cv.list |
List of earth models, one model for each fold.
These fold models have extra fields
cv.rsq and cv.rsq.per.response
(and, if keepxy is set, also cv.test.y and cv.test.fitted.values ).
To save memory, lengthy fields
in the fold models are removed, unless you use the keepxy argument.
The "lengthy fields" are $bx , $fitted.values , and $residuals .
|
Contents
. Other implementations
. Limitations
. Multiple response models
. Generalized linear models
. Factors
. Binomial pairs
. The forward pass
. The pruning pass
. Execution time
. Memory use
. Cross validation
. Cross validating binomial and poisson models
. Using earth with fda and mda
. Migrating from mda::mars
. Standard model functions
. Frequently asked questions
Other implementations
The results are similar to but not identical to other Multivariate Adaptive Regression Splines implementations. The differences stem from the forward pass where very small implementation differences (or perturbations of the input data) can cause rather different selection of terms and knots (although similar GRSq's). The backward passes give identical or near identical results, given the same forward pass results.
The source code of earth
is derived from mars
in the mda
package written by
by Trevor Hastie and Robert Tibshirani.
See also mars.to.earth
.
The term "MARS" is trademarked and licensed exclusively to Salford Systems http://www.salfordsystems.com. Their implementation uses an engine written by Friedman and has some features not in earth.
StatSoft also have an implementation which they call MARSplines http://www.statsoft.com/textbook/stmars.html.
Limitations
The following aspects of MARS
are mentioned in Friedman's papers but not implemented in earth
:
i) Piecewise cubic models
ii) Model slicing (plotmo
goes part way)
iii) Handling missing values
iv) Automatic grouping of categorical predictors into subsets
v) Fast MARS h parameter
Multiple response models
If y
has k
columns
then earth builds k
simultaneous models.
(Note that y
will have multiple columns
if a factor in y
is expanded by earth;
see the "Factors" section below for details.)
Each model has the same set of basis functions
(the same bx
, selected.terms
, dirs
and cuts
)
but different coefficients (the returned coefficients
will have k
columns).
The models are built and pruned as usual but with the GCVs
and RSSs averaged across all k
responses.
Since earth attempts to optimize for all models simultaneously, the results will not be as "good" as building the models independently, i.e., the GCV of the combined model will usually not be as good as the GCVs for independently built models. However, the combined model may be a better model in other senses, depending on what you are trying to achieve. For example, it could be useful for earth to select the set of MARS terms that is best across all responses. This would typically be the case in a multiple response logistic model if some responses have a very small number of successes.
Note that automatic scaling of y
(via the scale.y
argument)
does not take place if y
has multiple columns.
You may want to scale your y
columns before calling earth
so each y
column gets the appropriate weight during model building
(a y
column with a big variance will influence the
model more than a column with a small variance).
You could do this by calling scale
before invoking earth
,
or by setting the scale.y
argument, or by using the
the wp
argument.
Here are a couple of (artificial) examples to show some of the ways
multiple responses can be specified.
Note that data.frames
can't be used on the left side
of a formula, so cbind
is used in the first example.
The examples use the standard technique of specifying
a tag lvol=
to name a column.
earth(cbind(Volume,lvol=log(Volume)) ~ ., data=trees) attach(trees) earth(data.frame(Girth,Height), data.frame(Volume,lvol=log(Volume)))Don't use a plus sign on the left side of the tilde. You might think that specifies a multiple response, but instead it arithmetically adds the columns.
For more details on using residual errors averaged over multiple responses see section 4.1 of Hastie, Tibshirani, and Buja Flexible Discriminant Analysis by Optimal Scoring, JASA, December 1994 http://www-stat.stanford.edu/~hastie/Papers/fda.pdf.
Generalized linear models
Earth builds a GLM model if the glm
argument is specified.
Earth builds the model as usual and then invokes
glm
on the MARS basis matrix bx
.
In more detail, the model is built as follows.
Earth first builds a standard MARS model, including
the internal call to lm.fit
on bx
after the pruning pass.
(See "The forward pass" and "The pruning pass" sections below).
Thus knot positions and terms are determined as usual and
all the standard fields in earth's return value will be present.
Earth then invokes glm
for the response on bx
with the parameters specified in the glm
argument to earth.
For multiple response models
(when y has multiple columns), the call to glm
is repeated independently for each response.
The results go into three extra fields in earth's return value:
glm.list
, glm.coefficients
, and glm.bpairs
.
Earth's internal call to glm
is made with
the glm
arguments x
, y
, and model
set TRUE.
Use summary(my.model)
as usual to see the model.
Use summary(my.model, details=TRUE)
to see more details, but
note that the printed P-values of the GLM coefficients are
meaningless.
This is because of the amount of preprocessing by earth —
the mantra is "variable selection overstates significance of the selected variables".
Use plot(my.model$glm.list[[1]])
to plot the (first) glm
model.
The examples below show how to specify earth-glm models.
The examples are only to illustrate the syntax and not necessarily useful models.
In the examples pmethod="none"
, otherwise with these artificial
models earth tends to prune away everything except the intercept term.
You wouldn't normally use pmethod="none"
.
Also, trace=1
, so if you run these examples you can
see how earth expands the input matrices
(as explained in the "Factors" and "Binomial pairs" sections below).
1. Two-level factor or logical response. The response is converted to a single column of 1s and 0s.
a1 <- earth(survived ~ ., data=etitanic, # c.f. Harrell "Reg Mod Strat" ch. 12 degree=2, trace=1, glm=list(family=binomial)) a1a <- earth(etitanic[,-2], etitanic[,2], # equivalent but using earth.default degree=2, trace=1, glm=list(family=binomial))2. Factor response. This example is for a factor with more than two levels. (For factors with just two levels, see the previous example.) The factor
pclass
is expanded to three indicator columns
(whereas in a direct call to glm
, pclass
would be treated
as logical: the first level versus all other levels).
a2 <- earth(pclass ~ ., data=etitanic, trace=1, glm=list(family=binomial))3. Binomial model specified with a column pair. This is a single response model but specified with a pair of columns: see the "Binomial pairs" section below. For variety, this example uses a
probit
link and (unnecessarily) increases maxit
.
ldose <- rep(0:5, 2) - 2 # V&R 4th ed. p. 191 sex <- factor(rep(c("male", "female"), times=c(6,6))) numdead <- c(1,4,9,13,18,20,0,2,6,10,12,16) pair <- cbind(numdead, numalive=20 - numdead) a3 <- earth(pair ~ sex + ldose, trace=1, pmethod="none", glm=list(family=binomial(link=probit), maxit=100))4. Double binomial response (i.e., a multiple response model) specified with two column pairs.
numdead2 <- c(2,8,11,12,20,23,0,4,6,16,12,14) # bogus data doublepair <- cbind(numdead, numalive=20-numdead, numdead2=numdead2, numalive2=30-numdead2) a4 <- earth(doublepair ~ sex + ldose, trace=1, pmethod="none", glm=list(family="binomial"))5. Poisson model.
counts <- c(18,17,15,20,10,20,25,13,12) # Dobson 1990 p. 93 outcome <- gl(3,1,9) treatment <- gl(3,3) a5 <- earth(counts ~ outcome + treatment, trace=1, pmethod="none", glm=list(family=poisson))6. Standard earth model, the long way.
a6 <- earth(numdead ~ sex + ldose, trace=1, pmethod="none", glm=list(family=gaussian(link=identity))) print(a6$coefficients == a6$glm.coefficients) # all TRUE
Factors
Factors in x:
Earth treats factors in x
in
the same way as standard R models such as lm
(where x
is taken to mean the right hand side of the formula).
Thus factors are expanded using the current setting
of options("contrasts")
.
Factors in y:
Earth treats factors in the response in a non-standard way that makes use
of earth's ability to handle multiple responses.
A two level factor (or logical) is converted to a single indicator column of 1s and 0s.
A factor with three or more levels
is converted into k
indicator columns of 1s and 0s, where k
is the number of levels
(the contrasts
matrix is diagonal, see contr.earth.response
).
This happens regardless of the options("contrasts")
setting and regardless of whether the factors are ordered or unordered.
For example, if a column in y
is a factor with levels
A
, B
, and C
,
the column will be expanded to three columns like this
(the actual data will vary but each row will have a single 1):
A B C # one column for each factor level 0 1 0 # each row has a single 1 1 0 0 0 0 1 0 0 1 0 0 1 ...In distinction, a standard
treatment contrast
on the rhs of a model with an intercept would have no first "A" column
(to prevent linear dependencies on the rhs of the model formula).
This expansion to multiple columns (which only happens for factors with more than two levels) means that earth will build a multiple response model as described in the "Multiple responses" section above.
Paired binomial response columns in y
are treated specially
— see the "Binomial pairs" section below.
Use trace=1
or higher to see the column names of the x
and y
matrices after factor processing.
Use trace=4
to see the first few rows of x
and y
after factor processing.
Here is an example which uses the etitanic
data to
predict the passenger class (not necessarily a sensible thing to
do but provides a good example here):
> data(etitanic) > head(etitanic) # pclass and sex are unordered factors pclass survived sex age sibsp parch 1 1st 1 female 29.000 0 0 2 1st 1 male 0.917 1 2 3 1st 0 female 2.000 1 2 > earth(pclass ~ ., data=etitanic, trace=1) # note col names in x and y below x is a 1046 by 5 matrix: 1=survived, 2=sexmale, 3=age, 4=sibsp, 5=parch y is a 1046 by 3 matrix: 1=1st, 2=2nd, 3=3rd rest not shown here...
Binomial pairs
This section is only relevant if you use earth's glm
argument
with a binomial or quasibinomial family
.
Users of the glm
function will be familiar with
the technique of specifying a binomial response as a two-column matrix,
with a column for the number of successes and a column for the failures.
Earth automatically detects when such columns are present in y
(by looking for adjacent columns which both have entries greater than 1).
The first column only is used to build the standard earth model.
Both columns are then passed to earth's internal call to glm
.
As always, use trace=1
to see how the columns
of x
and y
are expanded.
You can override this automatic detection by including a bpairs
parameter.
This is usually (always?) unnecessary. For example
glm=list(family=binomial, bpairs=c(TRUE, FALSE))specifies that there are two columns in the response with the second paired with the first. These examples
glm=list(family=binomial, bpairs=c(TRUE, FALSE, TRUE, FALSE)) glm=list(family=binomial, bpairs=c(1,3)) # equivalentspecify that the 1st and 2nd columns are a binomial pair and the 3rd and 4th columns another binomial pair.
The forward pass
Understanding the details of the forward and pruning passes
will help you understand earth's return value and
the admittedly large number of arguments.
The result of the forward pass is the MARS basis matrix bx
and
the set of terms defined by dirs
and cuts
(these are all fields in earth's return value,
but the bx
here includes all terms before trimming
back to selected.terms
).
The forward pass adds terms in pairs until the first of the following conditions is met:
i) reach maximum number of terms (nterms >= nk)
ii) reach DeltaRSq threshold (DeltaRSq < thresh)
, where
DeltaRSq is the difference in R-Squared caused by adding the current term pair,
and thresh
is the argument to earth
iii) reach max RSq (RSq > 1-thresh)
iv) reach min GRSq (GRSq < -10)
(-10 is a pathologically bad GRSq)
v) no new term increases the RSq (reached numerical limits).
Set trace>=1
to see the stopping condition and
trace>=2
to trace the forward pass.
You can disable all termination conditions
except (i) and (v) by setting thresh=0
.
See the FAQ below "Why do I get fewer terms than nk?".
Note that GCVs (via GRSq) are used during the forward pass only as one of the
(more unusual) stopping conditions and in trace
prints.
Changing the penalty
argument does not change the knot positions.
The various stopping conditions mean that the actual number of terms
created by the forward pass may be less than nk
.
There are other
reasons why the actual number of terms may be less than nk
:
(i) the forward pass discards one side of a term pair
if it adds nothing to the model —
but the forward pass counts terms as if they were actually created in pairs,
and, (ii) as a final step, the forward pass deletes linearly dependent terms, if any,
so all terms in dirs
and cuts
are independent.
And remember that the pruning pass will further discard terms.
The pruning pass
The pruning pass is handed the sets of terms created by the forward pass. Its job is to find the subset of those terms that gives the lowest GCV. The following description of the pruning pass explains how various fields in earth's returned value are generated.
The pruning pass works like this:
it determines the subset of terms in bx
(using pmethod
)
with the lowest RSS for each model size in 1:nprune
(see the Force.xtx.prune
argument above for some details).
It saves the RSS and term numbers for each such subset in rss.per.subset
and prune.terms
.
It then applies the Get.crit
function with penalty
to rss.per.subset
to yield gcv.per.subset
.
Finally it chooses the model with the lowest value in gcv.per.subset
,
puts its term numbers into selected.terms
,
and updates bx
by keeping only the selected.terms
.
After the pruning pass, earth runs lm.fit
to determine the
fitted.values
, residuals
, and coefficients
, by regressing
the response y
on bx
.
If y
has multiple columns then lm.fit
is called
for each column.
If a glm
argument is passed to earth,
earth runs glm
on (each column of) y
in addition to the above call to lm.fit
.
Set trace>=3
to trace the pruning pass.
By default Get.crit
is earth:::get.gcv
.
Alternative Get.crit
functions can be defined.
See the source code of get.gcv
for an example.
Execution time: "I wanna go fast"
For a given set of input data, the following can increase the speed of the forward pass:
i) decreasing fast.k
ii) decreasing nk
(because fewer forward pass terms)
iii) decreasing degree
iv) increasing thresh
(because fewer forward pass terms)
v) increasing min.span
.
The backward pass is normally much faster than the forward pass,
unless pmethod="exhaustive"
.
Reducing nprune
reduces exhaustive search time.
One strategy is to first build a large model
and then adjust pruning parameters such as nprune
using update.earth
.
The following very rough rules of thumb apply for large models.
Using minspan=1
instead of the default 0
will increase times by 20 to 50%.
Using fast.k=5
instead of the default 20
can give substantial speed gains
but will sometimes give a much smaller GRSq
.
Using an allowed
function slows down model building by about 10%.
Memory use
Earth does not impose specific limits on the model size.
Model size is limited only by the amount of memory on your system,
the maximum memory addressable by R, and your patience.
On a 32 bit machine with x
and y
of type
double (no factors),
the number of bytes of memory used by earth is about
8 * (nk^2 * ncol(x) + (nrow(x) * (3 + 2*nk + ncol(x)/2))).Earth prints the results of the above calculation if
trace>=4
.
Memory use peaks in the forward pass.
The bulk of the forward pass is implemented in C.
It allocates memory "outside of R" and so
memory.size will not report the memory it uses.
Before calling earth
, R itself will of course allocate
memory over and above the amount calculated above.
To reduce total memory usage, it sometimes helps to remove
variables
and call gc before invoking earth
.
Earth uses more memory if any elements of the
x
and y
arguments are not double,
because it must convert them to double internally.
The same applies if the subset
argument is used.
Earth uses more memory if trace>=2
(because DUP=TRUE
is required to pass predictor names
to earth's internal call to .C
).
Increasing the degree
does not change the memory requirement
but greatly increases the running time.
Here is an example of memory use:
the earth test suite builds a model using earth.default
with a 1e4 by 100 input matrix with nk=21
.
The Windows XP task manager
reports that the peak memory use when building this model is 47 MBytes.
Using the formula interface to earth pushes memory to 62 MBytes.
Increasing the number of rows in the input matrix to 1e5 pushes memory to 240 MBytes.
Cross validation
Use cross validation to get an estimate of RSq on independent data.
Example (note the nfold
parameter):
a <- earth(survived ~ ., data=etitanic, degree=2, nfold=10) summary(a) # note the CV-RSq fieldCross validation is done if
nfold
is greater than 1 (typically 10).
Earth first builds a standard model with all the data as usual.
This means that all the standard fields in earth's return value
appear as usual.
Earth then builds nfold
cross validated models.
It measures RSq on the test data (i.e., the left-out data) for each fold.
The final cross validation RSq is the mean of these RSq's.
Use summary.earth to see this final value and its standard deviation
across cross validation folds.
The cross validation results go into extra fields in earth's return value.
All of these have a cv
prefix —
see the "Value" section above for details.
For multiple response models, at each fold earth calculates the RSq for each
response independently, and combines these by taking their mean
(or weighted mean if the wp
argument is used).
With trace=.5
or higher,
earth prints out progress information as cross validation proceeds.
For example
CV fold 3: CV-RSq 0.622 ntrain-nz 384 ntest-nz 43shows that for cross validation fold number 3, the RSq on the test set (i.e., the left-out data) is 0.622. The printout also shows the number of non-zero values in the observed response in the fold's training set and test set. This is useful if you have a binary or factor response and want to check that you have enough examples of each factor level in each fold. With the
stratify
argument (which is enabled by default),
earth attempts to keep the numbers of each level
constant across folds.
For reproducibility, call set.seed before calling earth with nfold
.
Cross validating binomial and poisson models
If you cross validate a binomial or poisson model
(specified using earth's nfold
and glm
arguments),
earth returns the following additional statistics,
Each of these is measured on the test set for each fold,
and averaged across all folds
(except that the signed max absolute value instead of the average is used for MaxErr
).
Use summary.earth to see these statistics
and their standard deviation across folds.
CV-RSq cross validated R-Squared, identical to CV-RSq for non-glm models
MaxErr signed max absolute difference between the predicted and observed response.
This is the maximum of the absolute differences
between the predicted and observed response values, multiplied
by -1
if the sign of the difference is negative.
MeanDev deviation divided by the response length
CalibInt CalibSlope calibration intercept and slope (from regressing the observed response on the predicted response)
AUC (binomial models only) area under the ROC curve
cor (poisson models only) correlation between the predicted and observed response
See the source code in earth.cv.R
for details.
For multiple response models, at each fold earth calculates these statistics for each
response independently, and combines them by taking their mean,
or weighted mean if the wp
argument is used
(but takes the signed max absolute value instead of the mean for MaxErr
)
[TODO should do the same for CalibInt
, CalibSlope
?].
Taking the mean is a rather dubious way of combining results from
what are essentially quite different models,
but can nevertheless be useful.
Explanations of the above statistics can be found in the following (and many other) references:
T. Fawcett (2004) ROC Graphs: Notes and Practical Considerations for Researchers. Revised version of Technical report HP Laboratories. http://home.comcast.net/~tom.fawcett/public_html/papers
J. Pearce and S. Ferrier (2000) Evaluating the predictive performance of habitat models developed using logistic regression
F. Harrell (2001) Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS
Using earth with fda and mda
Earth
can be used with fda
and mda
in the mda
package. Earth will generate a multiple response model.
Use the fda/mda
argument keep.fitted=TRUE
if
you want to call plot.earth
later
(actually only necessary for large datasets, see the description of
keep.fitted
in fda
).
Use the earth
argument keepxy=TRUE
if you want to call
update.earth
or plotmo
later.
Example:
library(mda) (a <- fda(Species~., data=iris, keep.fitted=TRUE, method=earth, keepxy=TRUE)) plot(a) summary(a$fit) # examine earth model embedded in fda model plot(a$fit) plotmo(a$fit, ycolumn=1, ylim=c(-1.5,1.5), clip=FALSE) plotmo(a$fit, ycolumn=2, ylim=c(-1.5,1.5), clip=FALSE)
Migrating from mda::mars
Changing code from mda::mars
to earth
is usually just a matter
of changing the call from "mars
" to "earth
".
But there are a few argument differences and
earth will issue a warning if you give it a mars
-only argument.
The resulting model will be similar but not identical because of small implementation differences which are magnified by the inherent instability of the MARS forward pass.
If you are further processing the output of earth you will need to
consider differences in the returned value. The header of the
source file mars.to.earth.R
describes these.
Perhaps the most important is that mars
returns the
MARS basis matrix in a field called "x
"
whereas earth
returns "bx
".
Also, earth
returns "dirs
" rather than "factors
",
and in earth
this matrix can have entries of value 2 for linear predictors.
See also mars.to.earth.
Standard model functions
Standard model functions such as case.names
are provided for earth
objects and are not explicitly documented.
Many of these give warnings when the results are not what you may expect.
Pass warn=FALSE
to these functions to turn of just these warnings.
FREQUENTLY ASKED QUESTIONS
What are your plans for earth?
We would like to add support of case weights (to allow boosting), but that won't happen anytime soon.
How can I establish variable importance?
Use the evimp
function.
See its help page for more details.
The summary.earth
function lists the predictors
in order of estimated importance
using the nsubsets
criterion of evimp
.
Which predictors were added to the model first?
You can see the forward pass adding terms with trace=2
or higher.
But remember, pruning will usually remove some of the terms.
You can also use
summary(my.model, decomp="none")which will list the basis functions remaining after pruning, in the order they were added by the forward pass.
Which predictors are actually in the model?
The following function will give a list of predictors in the model:
get.used.pred.names <- function(obj) # obj is an earth object { any1 <- function(x) any(x != 0) # like any but no warning if x is double names(which(apply(obj$dirs[obj$selected.terms,,drop=FALSE],2,any1))) }How can I train on one set of data and test on another?
The example below demonstrates one way to train on 80% of the data and test on the remaining 20%.
train.subset <- sample(1:nrow(trees), .8 * nrow(trees)) test.subset <- (1:nrow(trees))[-train.subset] a <- earth(Volume ~ ., data = trees[train.subset, ]) yhat <- predict(a, newdata = trees[test.subset, ]) y <- trees$Volume[test.subset] print(1 - sum((y - yhat)^2) / sum((y - mean(y))^2)) # print R-SquaredIn practice a dataset larger than the one in the example should be used for splitting. The model variance is too high with this small set — run the example a few times to see how the model changes as
sample
splits the dataset differently on each run.
Also, remember that the test set should not be used for parameter tuning
because you will be optimizing for the test set —
instead use GCVs, separate parameter selection sets, or techniques
such as cross-validation.
Why do I get fewer terms than nk
, even with prune="none"
?
There are several conditions that can terminate the forward pass,
and reaching nk
is just one of them.
See the "Forward pass" section above.
Setting earth's argument thresh
to zero is treated as a special case:
thresh=0
disables all termination conditions except nk
and conditions involving numerical limits.
With thresh=0
, the measured GRSq
(and thus the efficacy of the pruning pass)
should be treated with skepticism,
especially if you get the warning
effective number of GCV parameters >= number of cases
.
Why do I get fewer terms than nprune
?
The pruning pass selects a model with the lowest GCV
that has nprune
or fewer terms.
Thus the nprune
argument specifies the maximum
number of permissible terms in the final pruned model.
You can work around this because you will get exactly nprune
terms if you specify pmethod="none"
.
Compare the output of these two examples:
earth(Volume ~ ., data = trees, trace=3) earth(Volume ~ ., data = trees, trace=3, pmethod="none")Another way to get exactly
nprune
terms is to specify penalty = -1
.
This special value of penalty
causes earth to set the GCV to RSS/nrow(x)
.
Since the training RSS always decreases with more terms,
the pruning pass will choose the maximum allowable number of terms.
An example:
earth(Volume ~ ., data = trees, trace=3, penalty=-1)
Is it best to hold down model size with nk
or nprune
?
If you want the best possible small model, build a big model
(by specifying a big nk
)
and prune it back (by specifying a small nprune
).
This is better than directly building a small model by
specifying a small nk
, because the pruning pass can look at all the
terms whereas the forward pass can only see one term ahead.
However, it is much faster building a small model by specifying a small nk
.
Can you give an example of the linpreds
argument?
With linpreds
you can specify which predictors should enter linearly,
instead of in hinge functions.
The linpreds
argument does not stipulate that a predictor must enter the model,
only that if it enters it should enter linearly.
Starting with
a1 <- earth(Volume ~ ., data = trees) plotmo(a1)we see in the
plotmo
graphs or by running evimp
that Height
is not as important as Girth
.
For collaborative evidence that Girth
is a more reliable
indicator of Volume
you can use pairs
:
pairs(trees, panel = panel.smooth)Since we want the simplest model that describes the data, we can specify that
Height
should enter linearly:
a2 <- earth(Volume ~ ., data = trees, linpreds = 2) # 2 is Height column summary(a2)which yields
Expression: -7.41 + 0.418 * Height + 5.86 * pmax(0, Girth - 12.9) - 2.41 * pmax(0, 12.9 - Girth)In this example, the second simpler model has almost the same RSS as the first model. We can make both
Girth
and Height
enter linearly with
a3 <- earth(Volume ~ ., data = trees, linpreds = c(1,2))or with (the single TRUE is recycled to the length of
linpreds
)
a4 <- earth(Volume ~ ., data = trees, linpreds = TRUE)But specifying that all predictors should enter linearly is not really a useful thing to do. In our simple example, the all-linear MARS model is the same as a standard linear model
a5 <- lm(Volume ~ ., data = trees)(compare the
summary
for each) but in general that will not be true.
Earth will not include a linear predictor if that predictor does not improve the model.
Can you give an example of the allowed
argument?
You can specify how variables are allowed to enter MARS terms
with the allowed
argument.
The interface is flexible but requires a bit of programming. We start with a simple example, which completely excludes one predictor from the model:
example1 <- function(degree, pred, parents) # returns TRUE if allowed { pred != 2 # disallow predictor 2, which is "Height" } a1 <- earth(Volume ~ ., data = trees, allowed = example1) print(summary(a1))But that's not much use, because it's simpler to exclude the predictor from the input matrix when invoking earth:
a2 <- earth(Volume ~ . - Height, data = trees)The example below is more useful. It prevents the specified predictor from being used in interaction terms. (The example is artificial because it's unlikely you would want to single out humidity from interactions in the ozone data.)
The parents
argument is the candidate parent's row in the dirs
matrix
(dirs
is described in the "Value" section above).
Each entry of parents
is 0, 1, -1, or 2, and you index
parents
on the predictor index.
Thus parents[pred]
is 0 if pred
is not in the parent term.
example2 <- function(degree, pred, parents) { # disallow humidity in terms of degree > 1 # 3 is the "humidity" column in the input matrix if (degree > 1 && (pred == 3 || parents[3])) return(FALSE) TRUE } a3 <- earth(O3 ~ ., data = ozone1, degree = 2, allowed = example2) print(summary(a3))The following example allows only the specified predictors in interaction terms:
example3 <- function(degree, pred, parents) { # allow only humidity and temp in terms of degree > 1 # 3 and 4 are the "humidity" and "temp" columns allowed.set = c(3,4) if (degree > 1 && (all(pred != allowed.set) || any(parents[-allowed.set]))) return(FALSE) TRUE } a4 <- earth(O3 ~ ., data = ozone1, degree = 2, allowed = example3) print(summary(a4))The basic MARS model building strategy is always applied even when there is an
allowed
function.
For example, earth considers a term for addition only
if all factors of that term except the new one are already in a model term.
This means that an allowed
function that inhibits, say, all degree 2
terms will also effectively inhibit higher degrees too, because
there will be no degree 2 terms for earth to extend to degree 3.
You can expect model building to be about 10% slower with an allowed
function
because of the time taken to invoke the allowed
function.
Using predictor names instead of indices in the "allowed" function.
You can use predictor names instead of indices using
the optional namesx
argument.
If present, namesx
is the column names of x
after factors have been expanded.
The first example above (the one that disallows Height
) can be rewritten as
example1a <- function(degree, pred, parents, namesx) { namesx[pred] != "Height" }Comparing strings is inefficient and the above example can be rewritten a little more efficiently using the optional
first
argument.
If present, this is TRUE the first time your allowed function is called for
the current model and thereafter FALSE.
iheight <- 0 # column index of "Height" example1b <- function(degree, pred, parents, namesx, first) { if (first) { # first time this function is invoked, so # stash column index of "Height" in iheight iheight <<- which(namesx == "Height") # note use of <<- not <- if (length(iheight) != 1) stop("no Height in ", paste(namesx, collapse=" ")) } pred != iheight }How does
summary.earth
order terms?
With decomp="none"
,
the terms are ordered as created by the forward pass.
With the default decomp="anova"
,
the terms are ordered in increasing order of interaction.
In detail:
(i) terms are sorted first on degree of interaction
(ii) then terms with a linpreds
linear factor before standard terms
(iii) then on the predictors (in the order of the columns in the input matrix)
(ii) and finally on increasing knot values.
It's actually earth:::reorder.earth
that does the ordering.
summary.earth
lists predictors with weird names that aren't in x
. What gives?
You probably have factors in your x
matrix,
and earth is applying contrasts
.
See the "Factors" section above.
Why pmax
and not max
in the output from summary.earth
(with style="pmax"
)?
With pmax
the earth equation is an R expression
that can handle multiple cases.
Thus the expression is consistent with the
way predict.earth
works — you can give predict
multiple cases (i.e., multiple rows in the input matrix)
and it will return a vector of predicted values.
What about boosting MARS?
If you want to boost, use boosted trees rather than boosted MARS — you will get better results.
More precisely, although gradient boosted MARS gives
better results than plain MARS,
if you would like to improve prediction performance (at the cost
of a more complicated and less interpretable model)
you will usually get better results with
boosted trees (via, say, the gbm
package) than with boosted MARS.
See Gillian Ward (2007) Statistics in Ecological Modeling:
Presence-Only Data and Boosted Mars (Doctoral Thesis)
http://www-stat.stanford.edu/~hastie/THESES/Gill_Ward.pdf.
This could change as the state of the art advances.
What about bagging MARS?
The caret
package provides functions for bagging MARS
(and for parameter selection).
What is a GCV, in simple terms?
GCVs are important for MARS because the pruning pass uses GCVs to evaluate model subsets.
In general terms, when testing a model (not necessarily a MARS model) we want to test generalization performance and so want to measure error on independent data, i.e., not on the training data. Often a decent set of independent data is unavailable and so we resort to cross validation or leave-one-out methods. But that can be painfully slow. As an alternative, for certain forms of model we can use a formula to approximate the error that would be determined by leave-one-out validation — that approximation is the GCV. The formula adjusts (i.e., increases) the training RSS to take into account the flexibility of the model. Summarizing, the GCV approximates the RSS (divided by the number of cases) that would be measured on independent data. Even when the approximation is not that good, it is usually good enough for comparing models during pruning.
GCVs were introduced by Craven and Wahba, and extended by Friedman for MARS. See Hastie et al. p216 and the Friedman MARS paper. GCV stands for "Generalized Cross Validation", a perhaps misleading term.
The GRSq
measure used in the earth package standardizes the raw GCV,
in the same way that R-Squared standardizes the RSS.
If GCVs are so important, why don't linear models use them?
First a few words about overfitting. An overfit model fits the training data well but will not give good predictions on new data. The idea is that the training data captures the underlying structure in the system being modeled plus noise. We want to model the underlying structure and ignore the noise. An overfit model models the specific realization of noise in the training data and thus is too specific to the training data.
The more flexible a model, the more its propensity to overfit the training data. Linear models are constrained, with usually only a few parameters, and don't have the tendency to overfit like more flexible models such as MARS. This means that for linear models, the RSS on the data used to build the model is usually an adequate measure of generalization ability.
This is no longer true if you do automatic variable selection on linear models,
because the process of selecting variables increases the flexibility
of the model. Hence the AIC — as used in, say, drop1
.
The GCV, AIC, and friends are means to the same end.
Depending on what information is available during model building.
we use one of these statistics to estimate model generalization performance
for the purpose of selecting a model.
What happened to get.nterms.per.degree
,
get.nused.preds.per.subset
, and reorder.earth
?
From release 1.3.0, some earth functions are no longer public,
to help simplify the user interface.
The functions are still available (and stable) if you need them —
use for example earth:::reorder.earth()
.
What happened to the ppenalty
argument?
This was removed (release 1.3.1) because it is no longer needed.
The ponly
argument of update.earth
is a more flexible way of achieving the same end.
Stephen Milborrow, derived from mda::mars
by Trevor Hastie and Robert Tibshirani.
The approach used for GLMs was motivated by work done by Jane Elith and John Leathwick (a representative paper is listed in the references below).
The evimp
function uses ideas from Max Kuhn's caret
package
http://cran.r-project.org/web/packages/caret/index.html.
Users are encouraged to send feedback — use milbo AT sonic PERIOD net http://www.milbo.users.sonic.net.
The primary references are the Friedman papers.
Readers may find the MARS section in Hastie, Tibshirani,
and Friedman a more accessible introduction.
The Wikipedia article is recommended for an elementary introduction.
Faraway takes a hands-on approach,
using the ozone
data to compare mda::mars
with other techniques.
(If you use Faraway's examples with earth
instead of mars
, use $bx
instead of $x
.)
Friedman and Silverman is recommended background reading for the MARS paper.
Earth's pruning pass uses the leaps
package which is based on
techniques in Miller.
Faraway (2005) Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23
Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1–141 http://www.salfordsystems.com/doc/MARS.pdf
Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 http://www.milbo.users.sonic.net/earth/Friedman-FastMars.pdf, http://www-stat.stanford.edu/research/index.html
Friedman and Silverman (1989) Flexible Parsimonious Smoothing and Additive Modeling Technometrics, Vol. 31, No. 1. http://links.jstor.org/sici?sici=0040-1706%28198902%2931%3A1%3C3%3AFPSAAM%3E2.0.CO%3B2-Z
Hastie, Tibshirani, and Friedman (2001) The Elements of Statistical Learning http://www-stat.stanford.edu/~hastie/pub.htm
Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005) Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052 http://www-stat.stanford.edu/~hastie/pub.htm, http://www.botany.unimelb.edu.au/envisci/about/staff/elith.html
Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression http://users.bigpond.net.au/amiller
Wikipedia article on MARS http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines
Start with summary.earth
, plot.earth
,
plotmo
, and evimp
.
etitanic
evimp
format.earth
mars.to.earth
model.matrix.earth
ozone1
plot.earth.models
plot.earth
plotd
plotmo
predict.earth
residuals.earth
summary.earth
update.earth
a <- earth(Volume ~ ., data = trees) summary(a, digits = 2, style = "pmax") # yields: # Call: earth(formula=Volume~., data=trees) # # Volume = # 23 # + 5.7 * pmax(0, Girth - 13) # - 2.9 * pmax(0, 13 - Girth) # + 0.72 * pmax(0, Height - 76) # # Selected 4 of 5 terms, and 2 of 2 predictors # Estimated importance: Girth Height # Number of terms at each degree of interaction: 1 3 (additive model) # GCV 11 RSS 213 GRSq 0.96 RSq 0.97