earth {earth}R Documentation

Multivariate Adaptive Regression Splines

Description

Build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS".

Usage

## S3 method for class 'formula':
earth(formula = stop("no 'formula' arg"),
   data, weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
   na.action = na.fail, glm = NULL, trace = 0,
   keepxy = FALSE, nfold=0, stratify=TRUE, ...)

## Default S3 method:
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
    weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
    na.action = na.fail, glm = NULL, trace = 0,
    keepxy = FALSE, nfold=0, stratify=TRUE, ...)

## S3 method for class 'fit':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
    weights = NULL, wp = NULL, scale.y = (NCOL(y)==1), subset = NULL,
    na.action = na.fail, glm = NULL, trace = 0,
    nk = max(21, 2 * ncol(x) + 1), degree = 1, 
    penalty = if(degree > 1) 3 else 2, thresh = 0.001, 
    minspan = 0, newvar.penalty = 0, fast.k = 20, fast.beta = 1, 
    linpreds = FALSE, allowed = NULL,
    pmethod = "backward", nprune = NULL, Object = NULL,
    Get.crit = get.gcv, Eval.model.subsets = eval.model.subsets,
    Print.pruning.pass = print.pruning.pass, Force.xtx.prune = FALSE,
    Use.beta.cache = TRUE, ...)

Arguments

To start off, look at the arguments formula, data, x, y, nk, and degree. Many users will find that those arguments are all they need, plus in some cases keepxy, nprune, penalty, minspan, and trace. For GLM models, use the glm argument. For cross validation, use the nfold argument.

formula Model formula.
data Data frame for formula.
x Matrix or dataframe containing the independent variables.
y Vector containing the response variable, or, in the case of multiple responses, a matrix or dataframe whose columns are the values for each response.
subset Index vector specifying which cases to use, i.e., which rows in x to use. Default is NULL, meaning all.
weights Weight vector (not yet supported).
wp Vector of response weights. Default is NULL, meaning no response weights. If specified, wp must have an element for each column of y (after factors, if any, have been expanded).
Note for mda::mars users: earth's internal normalization of wp is different from mars. Earth uses wp<-sqrt(wp/mean(wp)) and mars uses wp<-sqrt(wp/sum(wp)). Thus in earth, a wp with all elements equal is equivalent to no wp. For models built with wp, multiply the GCV calculated by mars by length(wp) to compare it to earth's GCV.
scale.y Scale y in the forward pass for better numeric stability [added Jan 2009]. Scaling here means subtract the mean and divide by the standard deviation. Default is NCOL(y)==1, i.e., scale y unless y has multiple columns.
na.action NA action. Default is na.fail, and only na.fail is supported. (Why? Because adding support to earth for other NA actions is easy, but making sure that is handled correctly internally in predict, plotmo, etc. is tricky. It is more reliable to make the user remove NAs before calling earth.)
glm NULL (default) or a list of arguments to glm. See the "Generalized linear models" section below. Example:
earth(y~x, glm=list(family=binomial))
trace Trace earth's execution. Default is 0. Values:
0 none .5 cross validation 1 overview 2 forward pass 3 pruning 4 model mats, memory use, more pruning, etc. 5 ...
keepxy Set to TRUE to retain the following in the returned value: x and y (or data), subset, and weights. Default is FALSE. The function update.earth and friends will use these instead of searching for them in the environment at the time update.earth is invoked. This argument also affects the amount of data kept when the nfold argument is used (see cv.list in the "Value" section below).

The following arguments are for the forward pass.
nk Maximum number of model terms before pruning. Includes the intercept. Default is max(21,2*NCOL(x)+1). The number of terms created by the forward pass will be less than nk if a forward stopping condition is reached before nk terms, or if the forward pass drops one side of a hinge pair to prevent linearly dependencies. See the "Forward pass" section below.
degree Maximum degree of interaction (Friedman's mi). Default is 1, meaning build an additive model (i.e., no interaction terms).
penalty Generalized Cross Validation (GCV) penalty per knot. Default is if(degree>1) 3 else 2. A value of 0 penalizes only terms, not knots. The value -1 is treated specially to mean no penalty, so GCV=RSS/n. Theory suggests values in the range of about 2 to 4. In practice, for big data sets larger values can be useful to force a smaller model. The FAQ section below has some information on GCVs.
thresh Forward stepping threshold. Default is 0.001. This is one of the arguments used to decide when forward stepping should terminate. See the "Forward pass" section below.
minspan Minimum distance between knots.
Use a value of 1 to consider all x values (which is good if the data are not noisy).
The default value is 0. The value 0 is treated specially and means calculate the minspan internally as per Friedman's MARS paper section 3.8 with alpha = 0.05. Set trace>=2 to see the calculated value.
This argument is intended to increase resistance to runs of noise in the input data. Higher values will increase smoothness in your model.
Note: predictor value extremes are ineligible for knots regardless of the minspan setting, as per the MARS paper equation 45.
newvar.penalty Penalty for adding a new variable in the forward pass (Friedman's gamma, equation 74 in the MARS paper). Default is 0, meaning no penalty for adding a new variable. Useful non-zero values range from about 0.01 to 0.2 — you will need to experiment. This argument can mitigate the effects of collinearity or concurvity in the input data, but anecdotal evidence is that it does not work very well. If you know two variables are strongly correlated then you would do better to delete one of them before calling earth.
fast.k Maximum number of parent terms considered at each step of the forward pass. Friedman invented this parameter to speed up the forward pass (see the Fast MARS paper section 3.0). Default is 20. Values of 0 or less are treated specially (as being equivalent to infinity), meaning no Fast MARS. Typical values, apart from 0, are 20, 10, or 5. In general, with a lower fast.k (say 5), earth is faster; with a higher fast.k, or with fast.k disabled (set to 0), earth builds a better model. However it is not unusual to get a slightly better model with a lower fast.k, and you may need to experiment.
fast.beta Fast MARS aging coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results.
linpreds Index vector specifying which predictors should enter linearly, as in lm.
The default is FALSE, meaning all predictors enter in the standard MARS fashion, i.e., in hinge functions.
A predictor's index in linpreds is the column number in the input matrix x after factors have been expanded. Examples are given in the FAQ section below.
Note: in the current implementation, the GCV penalty for predictors that enter linearly is the same as that for predictors with knots. That is not quite correct; linear terms should be penalized less.
allowed Function specifying which predictors can interact and how. Default is NULL, meaning all standard MARS terms are allowed.
Earth calls the allowed function just before adding a term in the forward pass. If allowed returns TRUE the term goes into the model as usual; if allowed returns FALSE the term is discarded. Examples are given in the FAQ section below.
Your allowed function should have the following prototype
allowed <- function(degree, pred, parents, namesx, first) {.....}
where
degree is the interaction degree of the candidate term. Will be 1 for additive terms.
pred is the index of the candidate predictor. A predictor's index in linpreds is the column number in the input matrix x after factors have been expanded.
parents is the candidate parent term's row in dirs.
namesx is optional and if present is the column names of x after factors have been expanded.
first is optional and if present is TRUE the first time your allowed function is invoked for the current model, and thereafter FALSE.
nfold Number of cross validation folds. Default is 0, i.e., no cross validation. If greater than 1, earth first builds a standard model as usual, with all the data. It then builds nfold cross validated models, measuring R-Squared on the left-out data each time. The final cross validation R-Squared is the mean of these R-Squareds. If a binomial or poisson model (see the glm argument), then further statistics are calculated. See the "Cross validation" section below for details.
stratify Only applies if nfold>1. Default is TRUE. Stratify the cross validation samples so that, for each column of the response y (after factors have been expanded), an approximately equal number of cases with a non-zero response occur in each cross validation subset. That means that if y is a factor, there will be approximately equal numbers of each factor level in each fold (see the "Factors" section below). We say "approximately equal" because the number of occurrences of a factor level may not be exactly divisible by the number of folds.

The following arguments are for the pruning pass.
pmethod Pruning method. Default is "backward". One of: backward none exhaustive forward seqrep. Use none to retain all the terms created by the forward pass. If y has multiple columns, then only backward or none is allowed. Pruning can take a while if exhaustive is chosen and the model is big (more than about 30 terms). The current version of the leaps package used during pruning does not allow user interrupts (i.e., you have to kill your R session to interrupt; in Windows hit Ctl-Alt-Delete or from the command line use tskill).
nprune Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms created by the forward pass (but typically not all terms will remain after pruning). Use this to reduce exhaustive search time, or to enforce an upper bound on the model size.

The following arguments are for internal or advanced use.
Object Earth object to be updated, for use by update.earth.
Get.crit Criterion function for model selection during pruning. By default a function that returns the GCV. See the "Pruning pass" section below.
Eval.model.subsets Function to evaluate model subsets — see notes in source code.
Print.pruning.pass Function to print pruning pass results — see notes in source code.
Force.xtx.prune Default is FALSE. This argument pertains to subset evaluation in the pruning pass. By default, if y has a single column then earth calls the leaps routines; if y has multiple columns then earth calls EvalSubsetsUsingXtx. The leaps routines are more accurate but do not support multiple responses (leaps is based on the QR decomposition and EvalSubsetsUsingXtx is based on the inverse of X'X). Setting Force.xtx.prune=TRUE forces use of EvalSubsetsUsingXtx, even if y has a single column.
Use.beta.cache Default is TRUE. Using the "beta cache" takes more memory but is faster (by 20% and often much more for large models). The beta cache uses nk * nk * ncol(x) * sizeof(double) bytes. Set Use.beta.cache=FALSE to save memory. (The beta cache is an innovation in this implementation of MARS and does not appear in Friedman's papers. It is not related to the fast.beta argument.)
... Dots are passed on to earth.fit.

Value

An object of class "earth" which is a list with the components listed below. Term refers to a term created during the forward pass (each line of the output from format.earth is a term). Term number 1 is always the intercept.

rss Residual sum-of-squares (RSS) of the model (summed over all responses if y has multiple columns).
rsq 1-rss/rss.null. R-Squared of the model (calculated over all responses). A measure of how well the model fits the training data. Note that rss.null is sum((y - mean(y))^2).
gcv Generalized Cross Validation (GCV) of the model (summed over all responses) The GCV is calculated using the penalty argument. For details of the GCV calculation, see equation 30 in Friedman's MARS paper and earth:::get.gcv.
grsq 1-gcv/gcv.null. An estimate of the predictive power of the model (calculated over all responses). Unlike rsq, in MARS models grsq can be negative. A negative grsq would indicate a severely over parameterized model — a model that would not generalize well even though it may be a good fit to the training data. Watch the GRSq take a nose dive in this example:
earth(mpg~., data=mtcars, pmethod="none", trace=3)
bx Matrix of basis functions applied to x. Each column corresponds to a selected term. Each row corresponds to a row in in the input matrix x, after taking subset. See model.matrix.earth for an example of bx handling. Example bx:
    (Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(...
[1,]          1           0.0           4.6                   0
[2,]          1           0.0           4.3                   0
[3,]          1           0.0           4.1                   0
...
dirs Matrix with one row per MARS term, and with with ij-th element equal to

0 if predictor j is not in term i
-1 if an expression of the form pmax(const - xj) is in term i
1 if an expression of the form pmax(xj - const) is in term i
2 if predictor j enters term i linearly.

This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Example dirs:
                       Girth Height
(Intercept)                0  0 #intercept
h(Girth-12.9)              1  0 #2nd term uses Girth
h(12.9-Girth)             -1  0 #3rd term uses Girth
h(Girth-12.9)*h(Height-76) 1  1 #4th term uses Girth and Height
...
cuts Matrix with ij-th element equal to the cut point for predictor j in term i. This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Note for programmers: the precedent is to use dirs for term names etc. and to only use cuts where cut information needed. Example cuts:
                           Girth Height
(Intercept)                    0   0  #intercept, no cuts
h(Girth-12.9)               12.9   0  #2nd term has cut at 12.9
h(12.9-Girth)               12.9   0  #3rd term has cut at 12.9
h(Girth-12.9)*h(Height-76)  12.9  76  #4th term has two cuts
...
selected.terms Vector of term numbers in the best model. Can be used as a row index vector into cuts and dirs. The first element selected.terms[1] is always 1, the intercept.
prune.terms A matrix specifying which terms appear in which pruning pass subsets. The row index of prune.terms is the model size. (The model size is the number of terms in the model. The intercept is considered to be a term.) Each row is a vector of term numbers for the best model of that size. An element is 0 if the term is not in the model, thus prune.terms is a lower triangular matrix, with dimensions nprune x nprune. The model selected by the pruning pass is at row number length(selected.terms). Example prune.terms:
[1,]    1  0  0  0  0  0  0 #intercept-only model
[2,]    1  2  0  0  0  0  0 #best 2 term model uses terms 1,2
[3,]    1  2  4  0  0  0  0 #best 3 term model uses terms 1,2,4
[4,]    1  2  6  9  0  0  0 #and so on
...
rss.per.response A vector of the RSS for each response. Length is the number of responses, i.e., ncol(y) after factors in y have been expanded. The rss component above is equal to sum(rss.per.response).
rsq.per.response A vector of the R-Squared for each response. Length is the number of responses.
gcv.per.response A vector of the GCV for each response. Length is the number of responses. The gcv component above is equal to sum(gcv.per.response).
grsq.per.response A vector of the GRSq for each response. Length is the number of responses.
rss.per.subset A vector of the RSS for each model subset generated by the pruning pass. Length is nprune. For multiple responses, the RSS is summed over all responses for each subset. The null RSS (i.e., the RSS of an intercept only-model) is rss.per.subset[1]. The rss above is
rss.per.subset[length(selected.terms)].
gcv.per.subset A vector of the GCV for each model in prune.terms. Length is nprune. For multiple responses, the GCV is summed over all responses for each subset. The null GCV (i.e., the GCV of an intercept-only model) is gcv.per.subset[1]. The gcv above is gcv.per.subset[length(selected.terms)].
fitted.values Fitted values. A matrix with dimensions nrow(y) x ncol(y) after factors in y have been expanded.
residuals Residuals. A matrix with dimensions nrow(y) x ncol(y) after factors in y have been expanded.
coefficients Regression coefficients. A matrix with dimensions length(selected.terms) x ncol(y) after factors in y have been expanded. Each column holds the least squares coefficients from regressing that column of y on bx. The first row holds the intercept coefficient(s).
penalty The GCV penalty used during pruning. A copy of earth's penalty argument.
call The call used to invoke earth.
terms Model frame terms. This component exists only if the model was built using earth.formula.
namesx Column names of x, generated internally by earth when necessary so each column of x has a name. Used, for example, by predict.earth to name columns if necessary.
namesx.org Original column names of x.
levels Levels of y if y is a factor
c(FALSE,TRUE) if y is logical
Else NULL
wp Copy of the wp argument to earth.

The following fields appear only if earth's argument keepxy is TRUE.
x
y
data
subset
weights Copies of the corresponding arguments to earth. Only exist if keepxy=TRUE.

The following fields appear only if earth's glm argument is used.
glm.list List of GLM models. Each element is the value returned by earth's internal call to glm for each response.
Thus if there is a single response (or a single binomial pair, see the "Binomial pairs" section below) this will be a one element list and you access the GLM model with my.earth.model$glm.list[[1]].
glm.coefficients GLM regression coefficients. Analogous to the coefficients field described above but for the GLM model(s). A matrix with dimensions length(selected.terms) x ncol(y) after factors in y have been expanded. Each column holds the coefficients from the GLM regression of that column of y on bx. This duplicates, for convenience, information in glm.list.
glm.bpairs NULL unless there are paired binomial columns. A logical vector, derived internally by earth, or a copy the bpairs specified by the user in the glm list. See the "Binomial pairs" section below.

The following fields appear only if the nfold argument is greater than 1.
cv.rsq.tab Matrix with nfold+1 rows and nresponse+1 columns, where nresponse is the number of responses, i.e., ncol(y) after factors in y have been expanded. The first nresponse elements of a row are the RSq's on the left-out data for each response of the model generated at that row's fold. The final column holds the row mean (a weighted mean if wp if specified). The final row of the table holds the column means. The values in this final row are the CV-RSqs printed by summary.earth.

Example for a single response model:
            y  mean
fold 1  0.909 0.909
fold 2  0.869 0.869
fold 3  0.952 0.952
fold 4  0.157 0.157
fold 5  0.961 0.961
mean    0.769 0.769
Example for a multiple response model:
          y1   y2    y3  mean
fold 1 0.915 0.951 0.944 0.937
fold 2 0.962 0.970 0.970 0.968
fold 3 0.914 0.940 0.942 0.932
fold 4 0.907 0.929 0.925 0.920
fold 5 0.947 0.987 0.979 0.971
mean   0.929 0.955 0.952 0.946
cv.maxerr.tab Like cv.rsq.tab but is the MaxErr at each fold. This is the signed max absolute value at each fold. Also, results are aggregrated using the signed max absolute value instead of the mean. The signed max absolute value is the maximum of the absolute difference between the predicted and observed response values, multiplied by -1 if the sign of the difference is negative.
cv.deviance.tab Like cv.rsq.tab but is the MeanDev at each fold. Binomial models only.
cv.calib.int.tab Like cv.rsq.tab but is the CalibInt at each fold. Binomial models only.
cv.calib.slope.tab Like cv.rsq.tab but is the CalibSlope at each fold. Binomial models only.
cv.auc.tab Like cv.rsq.tab but is the AUC at each fold. Binomial models only.
cv.cor.tab Like cv.rsq.tab but is the cor at each fold. Poisson models only.
cv.nterms Vector of length nfold+1. Number of MARS terms in the model generated at each cross validation fold, with the final element being the mean of these.
cv.nvars Vector of length nfold+1. Number of predictors in the model generated at each cross validation fold, with the final element being the mean of these.
cv.groups Specifies which cases went into which folds. Vector of length equal to the number of cases, with elements taking values in 1:nfold.
cv.list List of earth models, one model for each fold. These fold models have extra fields cv.rsq and cv.rsq.per.response (and, if keepxy is set, also cv.test.y and cv.test.fitted.values). To save memory, lengthy fields in the fold models are removed, unless you use the keepxy argument. The "lengthy fields" are $bx, $fitted.values, and $residuals.

Note

Contents

. Other implementations
. Limitations
. Multiple response models
. Generalized linear models
. Factors
. Binomial pairs
. The forward pass
. The pruning pass
. Execution time
. Memory use
. Cross validation
. Cross validating binomial and poisson models
. Using earth with fda and mda
. Migrating from mda::mars
. Standard model functions
. Frequently asked questions

Other implementations

The results are similar to but not identical to other Multivariate Adaptive Regression Splines implementations. The differences stem from the forward pass where very small implementation differences (or perturbations of the input data) can cause rather different selection of terms and knots (although similar GRSq's). The backward passes give identical or near identical results, given the same forward pass results.

The source code of earth is derived from mars in the mda package written by by Trevor Hastie and Robert Tibshirani. See also mars.to.earth.

The term "MARS" is trademarked and licensed exclusively to Salford Systems http://www.salfordsystems.com. Their implementation uses an engine written by Friedman and has some features not in earth.

StatSoft also have an implementation which they call MARSplines http://www.statsoft.com/textbook/stmars.html.

Limitations

The following aspects of MARS are mentioned in Friedman's papers but not implemented in earth:

i) Piecewise cubic models
ii) Model slicing (plotmo goes part way)
iii) Handling missing values
iv) Automatic grouping of categorical predictors into subsets
v) Fast MARS h parameter

Multiple response models

If y has k columns then earth builds k simultaneous models. (Note that y will have multiple columns if a factor in y is expanded by earth; see the "Factors" section below for details.) Each model has the same set of basis functions (the same bx, selected.terms, dirs and cuts) but different coefficients (the returned coefficients will have k columns). The models are built and pruned as usual but with the GCVs and RSSs averaged across all k responses.

Since earth attempts to optimize for all models simultaneously, the results will not be as "good" as building the models independently, i.e., the GCV of the combined model will usually not be as good as the GCVs for independently built models. However, the combined model may be a better model in other senses, depending on what you are trying to achieve. For example, it could be useful for earth to select the set of MARS terms that is best across all responses. This would typically be the case in a multiple response logistic model if some responses have a very small number of successes.

Note that automatic scaling of y (via the scale.y argument) does not take place if y has multiple columns. You may want to scale your y columns before calling earth so each y column gets the appropriate weight during model building (a y column with a big variance will influence the model more than a column with a small variance). You could do this by calling scale before invoking earth, or by setting the scale.y argument, or by using the the wp argument.

Here are a couple of (artificial) examples to show some of the ways multiple responses can be specified. Note that data.frames can't be used on the left side of a formula, so cbind is used in the first example. The examples use the standard technique of specifying a tag lvol= to name a column.

earth(cbind(Volume,lvol=log(Volume)) ~ ., data=trees)
attach(trees)
earth(data.frame(Girth,Height), data.frame(Volume,lvol=log(Volume)))
Don't use a plus sign on the left side of the tilde. You might think that specifies a multiple response, but instead it arithmetically adds the columns.

For more details on using residual errors averaged over multiple responses see section 4.1 of Hastie, Tibshirani, and Buja Flexible Discriminant Analysis by Optimal Scoring, JASA, December 1994 http://www-stat.stanford.edu/~hastie/Papers/fda.pdf.

Generalized linear models

Earth builds a GLM model if the glm argument is specified. Earth builds the model as usual and then invokes glm on the MARS basis matrix bx.

In more detail, the model is built as follows. Earth first builds a standard MARS model, including the internal call to lm.fit on bx after the pruning pass. (See "The forward pass" and "The pruning pass" sections below). Thus knot positions and terms are determined as usual and all the standard fields in earth's return value will be present. Earth then invokes glm for the response on bx with the parameters specified in the glm argument to earth. For multiple response models (when y has multiple columns), the call to glm is repeated independently for each response. The results go into three extra fields in earth's return value: glm.list, glm.coefficients, and glm.bpairs.

Earth's internal call to glm is made with the glm arguments x, y, and model set TRUE.

Use summary(my.model) as usual to see the model. Use summary(my.model, details=TRUE) to see more details, but note that the printed P-values of the GLM coefficients are meaningless. This is because of the amount of preprocessing by earth — the mantra is "variable selection overstates significance of the selected variables". Use plot(my.model$glm.list[[1]]) to plot the (first) glm model.

The examples below show how to specify earth-glm models. The examples are only to illustrate the syntax and not necessarily useful models. In the examples pmethod="none", otherwise with these artificial models earth tends to prune away everything except the intercept term. You wouldn't normally use pmethod="none". Also, trace=1, so if you run these examples you can see how earth expands the input matrices (as explained in the "Factors" and "Binomial pairs" sections below).

1. Two-level factor or logical response. The response is converted to a single column of 1s and 0s.

a1 <- earth(survived ~ ., data=etitanic, # c.f. Harrell "Reg Mod Strat" ch. 12
             degree=2, trace=1,
             glm=list(family=binomial))

a1a <- earth(etitanic[,-2], etitanic[,2], # equivalent but using earth.default
             degree=2, trace=1,
             glm=list(family=binomial))
2. Factor response. This example is for a factor with more than two levels. (For factors with just two levels, see the previous example.) The factor pclass is expanded to three indicator columns (whereas in a direct call to glm, pclass would be treated as logical: the first level versus all other levels).
a2 <- earth(pclass ~ ., data=etitanic, trace=1,
            glm=list(family=binomial))
3. Binomial model specified with a column pair. This is a single response model but specified with a pair of columns: see the "Binomial pairs" section below. For variety, this example uses a probit link and (unnecessarily) increases maxit.
ldose <- rep(0:5, 2) - 2 # V&R 4th ed. p. 191
sex <- factor(rep(c("male", "female"), times=c(6,6)))
numdead <- c(1,4,9,13,18,20,0,2,6,10,12,16)
pair <- cbind(numdead, numalive=20 - numdead)
a3 <- earth(pair ~ sex + ldose, trace=1, pmethod="none",
            glm=list(family=binomial(link=probit), maxit=100))
4. Double binomial response (i.e., a multiple response model) specified with two column pairs.
numdead2 <- c(2,8,11,12,20,23,0,4,6,16,12,14) # bogus data
doublepair <- cbind(numdead, numalive=20-numdead,
                    numdead2=numdead2, numalive2=30-numdead2)
a4 <- earth(doublepair ~ sex + ldose, trace=1, pmethod="none",
            glm=list(family="binomial"))
5. Poisson model.
counts <- c(18,17,15,20,10,20,25,13,12) # Dobson 1990 p. 93
outcome <- gl(3,1,9)
treatment <- gl(3,3)
a5 <- earth(counts ~ outcome + treatment, trace=1, pmethod="none",
            glm=list(family=poisson))
6. Standard earth model, the long way.
a6 <- earth(numdead ~ sex + ldose, trace=1, pmethod="none",
            glm=list(family=gaussian(link=identity)))
print(a6$coefficients == a6$glm.coefficients)  # all TRUE

Factors

Factors in x: Earth treats factors in x in the same way as standard R models such as lm (where x is taken to mean the right hand side of the formula). Thus factors are expanded using the current setting of options("contrasts").

Factors in y: Earth treats factors in the response in a non-standard way that makes use of earth's ability to handle multiple responses. A two level factor (or logical) is converted to a single indicator column of 1s and 0s. A factor with three or more levels is converted into k indicator columns of 1s and 0s, where k is the number of levels (the contrasts matrix is diagonal, see contr.earth.response). This happens regardless of the options("contrasts") setting and regardless of whether the factors are ordered or unordered. For example, if a column in y is a factor with levels A, B, and C, the column will be expanded to three columns like this (the actual data will vary but each row will have a single 1):

A B C  # one column for each factor level
0 1 0  # each row has a single 1
1 0 0
0 0 1
0 0 1
0 0 1
...
In distinction, a standard treatment contrast on the rhs of a model with an intercept would have no first "A" column (to prevent linear dependencies on the rhs of the model formula).

This expansion to multiple columns (which only happens for factors with more than two levels) means that earth will build a multiple response model as described in the "Multiple responses" section above.

Paired binomial response columns in y are treated specially — see the "Binomial pairs" section below.

Use trace=1 or higher to see the column names of the x and y matrices after factor processing. Use trace=4 to see the first few rows of x and y after factor processing.

Here is an example which uses the etitanic data to predict the passenger class (not necessarily a sensible thing to do but provides a good example here):

> data(etitanic)
> head(etitanic) # pclass and sex are unordered factors

  pclass survived    sex    age sibsp parch
1    1st        1 female 29.000     0     0
2    1st        1   male  0.917     1     2
3    1st        0 female  2.000     1     2

> earth(pclass ~ ., data=etitanic, trace=1) # note col names in x and y below

x is a 1046 by 5 matrix: 1=survived, 2=sexmale, 3=age, 4=sibsp, 5=parch
y is a 1046 by 3 matrix: 1=1st, 2=2nd, 3=3rd
rest not shown here...

Binomial pairs

This section is only relevant if you use earth's glm argument with a binomial or quasibinomial family.

Users of the glm function will be familiar with the technique of specifying a binomial response as a two-column matrix, with a column for the number of successes and a column for the failures. Earth automatically detects when such columns are present in y (by looking for adjacent columns which both have entries greater than 1). The first column only is used to build the standard earth model. Both columns are then passed to earth's internal call to glm. As always, use trace=1 to see how the columns of x and y are expanded.

You can override this automatic detection by including a bpairs parameter. This is usually (always?) unnecessary. For example

glm=list(family=binomial, bpairs=c(TRUE, FALSE))
specifies that there are two columns in the response with the second paired with the first. These examples
glm=list(family=binomial, bpairs=c(TRUE, FALSE, TRUE, FALSE))
glm=list(family=binomial, bpairs=c(1,3)) # equivalent
specify that the 1st and 2nd columns are a binomial pair and the 3rd and 4th columns another binomial pair.

The forward pass

Understanding the details of the forward and pruning passes will help you understand earth's return value and the admittedly large number of arguments. The result of the forward pass is the MARS basis matrix bx and the set of terms defined by dirs and cuts (these are all fields in earth's return value, but the bx here includes all terms before trimming back to selected.terms).

The forward pass adds terms in pairs until the first of the following conditions is met:

i) reach maximum number of terms (nterms >= nk)
ii) reach DeltaRSq threshold (DeltaRSq < thresh), where DeltaRSq is the difference in R-Squared caused by adding the current term pair, and thresh is the argument to earth
iii) reach max RSq (RSq > 1-thresh)
iv) reach min GRSq (GRSq < -10) (-10 is a pathologically bad GRSq)
v) no new term increases the RSq (reached numerical limits).

Set trace>=1 to see the stopping condition and trace>=2 to trace the forward pass.

You can disable all termination conditions except (i) and (v) by setting thresh=0. See the FAQ below "Why do I get fewer terms than nk?".

Note that GCVs (via GRSq) are used during the forward pass only as one of the (more unusual) stopping conditions and in trace prints. Changing the penalty argument does not change the knot positions.

The various stopping conditions mean that the actual number of terms created by the forward pass may be less than nk. There are other reasons why the actual number of terms may be less than nk: (i) the forward pass discards one side of a term pair if it adds nothing to the model — but the forward pass counts terms as if they were actually created in pairs, and, (ii) as a final step, the forward pass deletes linearly dependent terms, if any, so all terms in dirs and cuts are independent. And remember that the pruning pass will further discard terms.

The pruning pass

The pruning pass is handed the sets of terms created by the forward pass. Its job is to find the subset of those terms that gives the lowest GCV. The following description of the pruning pass explains how various fields in earth's returned value are generated.

The pruning pass works like this: it determines the subset of terms in bx (using pmethod) with the lowest RSS for each model size in 1:nprune (see the Force.xtx.prune argument above for some details). It saves the RSS and term numbers for each such subset in rss.per.subset and prune.terms. It then applies the Get.crit function with penalty to rss.per.subset to yield gcv.per.subset. Finally it chooses the model with the lowest value in gcv.per.subset, puts its term numbers into selected.terms, and updates bx by keeping only the selected.terms.

After the pruning pass, earth runs lm.fit to determine the fitted.values, residuals, and coefficients, by regressing the response y on bx. If y has multiple columns then lm.fit is called for each column.

If a glm argument is passed to earth, earth runs glm on (each column of) y in addition to the above call to lm.fit.

Set trace>=3 to trace the pruning pass.

By default Get.crit is earth:::get.gcv. Alternative Get.crit functions can be defined. See the source code of get.gcv for an example.

Execution time: "I wanna go fast"

For a given set of input data, the following can increase the speed of the forward pass:

i) decreasing fast.k
ii) decreasing nk (because fewer forward pass terms)
iii) decreasing degree
iv) increasing thresh (because fewer forward pass terms)
v) increasing min.span.

The backward pass is normally much faster than the forward pass, unless pmethod="exhaustive". Reducing nprune reduces exhaustive search time. One strategy is to first build a large model and then adjust pruning parameters such as nprune using update.earth.

The following very rough rules of thumb apply for large models. Using minspan=1 instead of the default 0 will increase times by 20 to 50%. Using fast.k=5 instead of the default 20 can give substantial speed gains but will sometimes give a much smaller GRSq. Using an allowed function slows down model building by about 10%.

Memory use

Earth does not impose specific limits on the model size. Model size is limited only by the amount of memory on your system, the maximum memory addressable by R, and your patience. On a 32 bit machine with x and y of type double (no factors), the number of bytes of memory used by earth is about

8 * (nk^2 * ncol(x) + (nrow(x) * (3 + 2*nk + ncol(x)/2))).
Earth prints the results of the above calculation if trace>=4. Memory use peaks in the forward pass. The bulk of the forward pass is implemented in C. It allocates memory "outside of R" and so memory.size will not report the memory it uses.

Before calling earth, R itself will of course allocate memory over and above the amount calculated above. To reduce total memory usage, it sometimes helps to remove variables and call gc before invoking earth.

Earth uses more memory if any elements of the x and y arguments are not double, because it must convert them to double internally. The same applies if the subset argument is used. Earth uses more memory if trace>=2 (because DUP=TRUE is required to pass predictor names to earth's internal call to .C). Increasing the degree does not change the memory requirement but greatly increases the running time.

Here is an example of memory use: the earth test suite builds a model using earth.default with a 1e4 by 100 input matrix with nk=21. The Windows XP task manager reports that the peak memory use when building this model is 47 MBytes. Using the formula interface to earth pushes memory to 62 MBytes. Increasing the number of rows in the input matrix to 1e5 pushes memory to 240 MBytes.

Cross validation

Use cross validation to get an estimate of RSq on independent data. Example (note the nfold parameter):

a <- earth(survived ~ ., data=etitanic, degree=2, nfold=10)
summary(a) # note the CV-RSq field
Cross validation is done if nfold is greater than 1 (typically 10). Earth first builds a standard model with all the data as usual. This means that all the standard fields in earth's return value appear as usual. Earth then builds nfold cross validated models. It measures RSq on the test data (i.e., the left-out data) for each fold. The final cross validation RSq is the mean of these RSq's. Use summary.earth to see this final value and its standard deviation across cross validation folds.

The cross validation results go into extra fields in earth's return value. All of these have a cv prefix — see the "Value" section above for details.

For multiple response models, at each fold earth calculates the RSq for each response independently, and combines these by taking their mean (or weighted mean if the wp argument is used).

With trace=.5 or higher, earth prints out progress information as cross validation proceeds. For example

CV fold 3: CV-RSq 0.622  ntrain-nz 384 ntest-nz 43
shows that for cross validation fold number 3, the RSq on the test set (i.e., the left-out data) is 0.622. The printout also shows the number of non-zero values in the observed response in the fold's training set and test set. This is useful if you have a binary or factor response and want to check that you have enough examples of each factor level in each fold. With the stratify argument (which is enabled by default), earth attempts to keep the numbers of each level constant across folds.

For reproducibility, call set.seed before calling earth with nfold.

Cross validating binomial and poisson models

If you cross validate a binomial or poisson model (specified using earth's nfold and glm arguments), earth returns the following additional statistics, Each of these is measured on the test set for each fold, and averaged across all folds (except that the signed max absolute value instead of the average is used for MaxErr). Use summary.earth to see these statistics and their standard deviation across folds.

CV-RSq cross validated R-Squared, identical to CV-RSq for non-glm models

MaxErr signed max absolute difference between the predicted and observed response. This is the maximum of the absolute differences between the predicted and observed response values, multiplied by -1 if the sign of the difference is negative.

MeanDev deviation divided by the response length

CalibInt CalibSlope calibration intercept and slope (from regressing the observed response on the predicted response)

AUC (binomial models only) area under the ROC curve

cor (poisson models only) correlation between the predicted and observed response

See the source code in earth.cv.R for details.

For multiple response models, at each fold earth calculates these statistics for each response independently, and combines them by taking their mean, or weighted mean if the wp argument is used (but takes the signed max absolute value instead of the mean for MaxErr) [TODO should do the same for CalibInt, CalibSlope?]. Taking the mean is a rather dubious way of combining results from what are essentially quite different models, but can nevertheless be useful.

Explanations of the above statistics can be found in the following (and many other) references:

T. Fawcett (2004) ROC Graphs: Notes and Practical Considerations for Researchers. Revised version of Technical report HP Laboratories. http://home.comcast.net/~tom.fawcett/public_html/papers

J. Pearce and S. Ferrier (2000) Evaluating the predictive performance of habitat models developed using logistic regression

F. Harrell (2001) Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS

Using earth with fda and mda

Earth can be used with fda and mda in the mda package. Earth will generate a multiple response model. Use the fda/mda argument keep.fitted=TRUE if you want to call plot.earth later (actually only necessary for large datasets, see the description of keep.fitted in fda). Use the earth argument keepxy=TRUE if you want to call update.earth or plotmo later. Example:

library(mda)
(a <- fda(Species~., data=iris, keep.fitted=TRUE, method=earth, keepxy=TRUE))
plot(a)
summary(a$fit)  # examine earth model embedded in fda model
plot(a$fit)
plotmo(a$fit, ycolumn=1, ylim=c(-1.5,1.5), clip=FALSE)
plotmo(a$fit, ycolumn=2, ylim=c(-1.5,1.5), clip=FALSE)

Migrating from mda::mars

Changing code from mda::mars to earth is usually just a matter of changing the call from "mars" to "earth". But there are a few argument differences and earth will issue a warning if you give it a mars-only argument.

The resulting model will be similar but not identical because of small implementation differences which are magnified by the inherent instability of the MARS forward pass.

If you are further processing the output of earth you will need to consider differences in the returned value. The header of the source file mars.to.earth.R describes these. Perhaps the most important is that mars returns the MARS basis matrix in a field called "x" whereas earth returns "bx". Also, earth returns "dirs" rather than "factors", and in earth this matrix can have entries of value 2 for linear predictors.

See also mars.to.earth.

Standard model functions

Standard model functions such as case.names are provided for earth objects and are not explicitly documented. Many of these give warnings when the results are not what you may expect. Pass warn=FALSE to these functions to turn of just these warnings.

FREQUENTLY ASKED QUESTIONS

What are your plans for earth?

We would like to add support of case weights (to allow boosting), but that won't happen anytime soon.

How can I establish variable importance?

Use the evimp function. See its help page for more details.

The summary.earth function lists the predictors in order of estimated importance using the nsubsets criterion of evimp.

Which predictors were added to the model first?

You can see the forward pass adding terms with trace=2 or higher. But remember, pruning will usually remove some of the terms. You can also use

summary(my.model, decomp="none")
which will list the basis functions remaining after pruning, in the order they were added by the forward pass.

Which predictors are actually in the model?

The following function will give a list of predictors in the model:

get.used.pred.names <- function(obj) # obj is an earth object
{
  any1 <- function(x) any(x != 0)    # like any but no warning if x is double
  names(which(apply(obj$dirs[obj$selected.terms,,drop=FALSE],2,any1)))
}
How can I train on one set of data and test on another?

The example below demonstrates one way to train on 80% of the data and test on the remaining 20%.

train.subset <- sample(1:nrow(trees), .8 * nrow(trees))
test.subset <- (1:nrow(trees))[-train.subset]
a <- earth(Volume ~ ., data = trees[train.subset, ])
yhat <- predict(a, newdata = trees[test.subset, ])
y <- trees$Volume[test.subset]
print(1 - sum((y - yhat)^2) / sum((y - mean(y))^2)) # print R-Squared
In practice a dataset larger than the one in the example should be used for splitting. The model variance is too high with this small set — run the example a few times to see how the model changes as sample splits the dataset differently on each run. Also, remember that the test set should not be used for parameter tuning because you will be optimizing for the test set — instead use GCVs, separate parameter selection sets, or techniques such as cross-validation.

Why do I get fewer terms than nk, even with prune="none"?

There are several conditions that can terminate the forward pass, and reaching nk is just one of them. See the "Forward pass" section above.

Setting earth's argument thresh to zero is treated as a special case: thresh=0 disables all termination conditions except nk and conditions involving numerical limits. With thresh=0, the measured GRSq (and thus the efficacy of the pruning pass) should be treated with skepticism, especially if you get the warning
effective number of GCV parameters >= number of cases.

Why do I get fewer terms than nprune?

The pruning pass selects a model with the lowest GCV that has nprune or fewer terms. Thus the nprune argument specifies the maximum number of permissible terms in the final pruned model.

You can work around this because you will get exactly nprune terms if you specify pmethod="none". Compare the output of these two examples:

earth(Volume ~ ., data = trees, trace=3)
earth(Volume ~ ., data = trees, trace=3, pmethod="none")
Another way to get exactly nprune terms is to specify penalty = -1. This special value of penalty causes earth to set the GCV to RSS/nrow(x). Since the training RSS always decreases with more terms, the pruning pass will choose the maximum allowable number of terms. An example:
earth(Volume ~ ., data = trees, trace=3, penalty=-1)

Is it best to hold down model size with nk or nprune?

If you want the best possible small model, build a big model (by specifying a big nk) and prune it back (by specifying a small nprune). This is better than directly building a small model by specifying a small nk, because the pruning pass can look at all the terms whereas the forward pass can only see one term ahead. However, it is much faster building a small model by specifying a small nk.

Can you give an example of the linpreds argument?

With linpreds you can specify which predictors should enter linearly, instead of in hinge functions. The linpreds argument does not stipulate that a predictor must enter the model, only that if it enters it should enter linearly. Starting with

a1 <- earth(Volume ~ ., data = trees)
plotmo(a1)
we see in the plotmo graphs or by running evimp that Height is not as important as Girth. For collaborative evidence that Girth is a more reliable indicator of Volume you can use pairs:
pairs(trees, panel = panel.smooth)
Since we want the simplest model that describes the data, we can specify that Height should enter linearly:
a2 <- earth(Volume ~ ., data = trees, linpreds = 2)  # 2 is Height column
summary(a2)
which yields
Expression:
  -7.41
  + 0.418 * Height
  +  5.86 * pmax(0,  Girth -   12.9)
  -  2.41 * pmax(0,   12.9 -  Girth)
In this example, the second simpler model has almost the same RSS as the first model. We can make both Girth and Height enter linearly with
a3 <- earth(Volume ~ ., data = trees, linpreds = c(1,2))
or with (the single TRUE is recycled to the length of linpreds)
a4 <- earth(Volume ~ ., data = trees, linpreds = TRUE)
But specifying that all predictors should enter linearly is not really a useful thing to do. In our simple example, the all-linear MARS model is the same as a standard linear model
a5 <- lm(Volume ~ ., data = trees)
(compare the summary for each) but in general that will not be true. Earth will not include a linear predictor if that predictor does not improve the model.

Can you give an example of the allowed argument?

You can specify how variables are allowed to enter MARS terms with the allowed argument.

The interface is flexible but requires a bit of programming. We start with a simple example, which completely excludes one predictor from the model:

example1  <- function(degree, pred, parents)   # returns TRUE if allowed
{
    pred != 2  # disallow predictor 2, which is "Height"
}
a1 <- earth(Volume ~ ., data = trees, allowed = example1)
print(summary(a1))
But that's not much use, because it's simpler to exclude the predictor from the input matrix when invoking earth:
a2 <- earth(Volume ~ . - Height, data = trees)
The example below is more useful. It prevents the specified predictor from being used in interaction terms. (The example is artificial because it's unlikely you would want to single out humidity from interactions in the ozone data.)

The parents argument is the candidate parent's row in the dirs matrix (dirs is described in the "Value" section above). Each entry of parents is 0, 1, -1, or 2, and you index parents on the predictor index. Thus parents[pred] is 0 if pred is not in the parent term.

example2 <- function(degree, pred, parents)
{
    # disallow humidity in terms of degree > 1
    # 3 is the "humidity" column in the input matrix
    if (degree > 1 && (pred == 3 || parents[3]))
        return(FALSE)
    TRUE
}
a3 <- earth(O3 ~ ., data = ozone1, degree = 2, allowed = example2)
print(summary(a3))
The following example allows only the specified predictors in interaction terms:
example3 <- function(degree, pred, parents)
{
    # allow only humidity and temp in terms of degree > 1
    # 3 and 4 are the "humidity" and "temp" columns
    allowed.set = c(3,4)
    if (degree > 1 &&
           (all(pred != allowed.set) || any(parents[-allowed.set])))
        return(FALSE)
    TRUE
}
a4 <- earth(O3 ~ ., data = ozone1, degree = 2, allowed = example3)
print(summary(a4))
The basic MARS model building strategy is always applied even when there is an allowed function. For example, earth considers a term for addition only if all factors of that term except the new one are already in a model term. This means that an allowed function that inhibits, say, all degree 2 terms will also effectively inhibit higher degrees too, because there will be no degree 2 terms for earth to extend to degree 3.

You can expect model building to be about 10% slower with an allowed function because of the time taken to invoke the allowed function.

Using predictor names instead of indices in the "allowed" function. You can use predictor names instead of indices using the optional namesx argument. If present, namesx is the column names of x after factors have been expanded. The first example above (the one that disallows Height) can be rewritten as

example1a  <- function(degree, pred, parents, namesx)
{
    namesx[pred] != "Height"
}
Comparing strings is inefficient and the above example can be rewritten a little more efficiently using the optional first argument. If present, this is TRUE the first time your allowed function is called for the current model and thereafter FALSE.
iheight <- 0    # column index of "Height"

example1b  <- function(degree, pred, parents, namesx, first)
{
    if (first) {
        # first time this function is invoked, so
        # stash column index of "Height" in iheight
        iheight <<- which(namesx == "Height")   # note use of <<- not <-
        if (length(iheight) != 1)
            stop("no Height in ", paste(namesx, collapse=" "))
    }
    pred != iheight
}
How does summary.earth order terms?

With decomp="none", the terms are ordered as created by the forward pass.

With the default decomp="anova", the terms are ordered in increasing order of interaction. In detail:
(i) terms are sorted first on degree of interaction
(ii) then terms with a linpreds linear factor before standard terms
(iii) then on the predictors (in the order of the columns in the input matrix)
(ii) and finally on increasing knot values.

It's actually earth:::reorder.earth that does the ordering.

summary.earth lists predictors with weird names that aren't in x. What gives?

You probably have factors in your x matrix, and earth is applying contrasts. See the "Factors" section above.

Why pmax and not max in the output from summary.earth (with style="pmax")?

With pmax the earth equation is an R expression that can handle multiple cases. Thus the expression is consistent with the way predict.earth works — you can give predict multiple cases (i.e., multiple rows in the input matrix) and it will return a vector of predicted values.

What about boosting MARS?

If you want to boost, use boosted trees rather than boosted MARS — you will get better results.

More precisely, although gradient boosted MARS gives better results than plain MARS, if you would like to improve prediction performance (at the cost of a more complicated and less interpretable model) you will usually get better results with boosted trees (via, say, the gbm package) than with boosted MARS. See Gillian Ward (2007) Statistics in Ecological Modeling: Presence-Only Data and Boosted Mars (Doctoral Thesis) http://www-stat.stanford.edu/~hastie/THESES/Gill_Ward.pdf.

This could change as the state of the art advances.

What about bagging MARS?

The caret package provides functions for bagging MARS (and for parameter selection).

What is a GCV, in simple terms?

GCVs are important for MARS because the pruning pass uses GCVs to evaluate model subsets.

In general terms, when testing a model (not necessarily a MARS model) we want to test generalization performance and so want to measure error on independent data, i.e., not on the training data. Often a decent set of independent data is unavailable and so we resort to cross validation or leave-one-out methods. But that can be painfully slow. As an alternative, for certain forms of model we can use a formula to approximate the error that would be determined by leave-one-out validation — that approximation is the GCV. The formula adjusts (i.e., increases) the training RSS to take into account the flexibility of the model. Summarizing, the GCV approximates the RSS (divided by the number of cases) that would be measured on independent data. Even when the approximation is not that good, it is usually good enough for comparing models during pruning.

GCVs were introduced by Craven and Wahba, and extended by Friedman for MARS. See Hastie et al. p216 and the Friedman MARS paper. GCV stands for "Generalized Cross Validation", a perhaps misleading term.

The GRSq measure used in the earth package standardizes the raw GCV, in the same way that R-Squared standardizes the RSS.

If GCVs are so important, why don't linear models use them?

First a few words about overfitting. An overfit model fits the training data well but will not give good predictions on new data. The idea is that the training data captures the underlying structure in the system being modeled plus noise. We want to model the underlying structure and ignore the noise. An overfit model models the specific realization of noise in the training data and thus is too specific to the training data.

The more flexible a model, the more its propensity to overfit the training data. Linear models are constrained, with usually only a few parameters, and don't have the tendency to overfit like more flexible models such as MARS. This means that for linear models, the RSS on the data used to build the model is usually an adequate measure of generalization ability.

This is no longer true if you do automatic variable selection on linear models, because the process of selecting variables increases the flexibility of the model. Hence the AIC — as used in, say, drop1. The GCV, AIC, and friends are means to the same end. Depending on what information is available during model building. we use one of these statistics to estimate model generalization performance for the purpose of selecting a model.

What happened to get.nterms.per.degree, get.nused.preds.per.subset, and reorder.earth?

From release 1.3.0, some earth functions are no longer public, to help simplify the user interface. The functions are still available (and stable) if you need them — use for example earth:::reorder.earth().

What happened to the ppenalty argument?

This was removed (release 1.3.1) because it is no longer needed. The ponly argument of update.earth is a more flexible way of achieving the same end.

Author(s)

Stephen Milborrow, derived from mda::mars by Trevor Hastie and Robert Tibshirani.

The approach used for GLMs was motivated by work done by Jane Elith and John Leathwick (a representative paper is listed in the references below).

The evimp function uses ideas from Max Kuhn's caret package http://cran.r-project.org/web/packages/caret/index.html.

Users are encouraged to send feedback — use milbo AT sonic PERIOD net http://www.milbo.users.sonic.net.

References

The primary references are the Friedman papers. Readers may find the MARS section in Hastie, Tibshirani, and Friedman a more accessible introduction. The Wikipedia article is recommended for an elementary introduction. Faraway takes a hands-on approach, using the ozone data to compare mda::mars with other techniques. (If you use Faraway's examples with earth instead of mars, use $bx instead of $x.) Friedman and Silverman is recommended background reading for the MARS paper. Earth's pruning pass uses the leaps package which is based on techniques in Miller.

Faraway (2005) Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23

Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1–141 http://www.salfordsystems.com/doc/MARS.pdf

Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 http://www.milbo.users.sonic.net/earth/Friedman-FastMars.pdf, http://www-stat.stanford.edu/research/index.html

Friedman and Silverman (1989) Flexible Parsimonious Smoothing and Additive Modeling Technometrics, Vol. 31, No. 1. http://links.jstor.org/sici?sici=0040-1706%28198902%2931%3A1%3C3%3AFPSAAM%3E2.0.CO%3B2-Z

Hastie, Tibshirani, and Friedman (2001) The Elements of Statistical Learning http://www-stat.stanford.edu/~hastie/pub.htm

Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005) Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052 http://www-stat.stanford.edu/~hastie/pub.htm, http://www.botany.unimelb.edu.au/envisci/about/staff/elith.html

Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression http://users.bigpond.net.au/amiller

Wikipedia article on MARS http://en.wikipedia.org/wiki/Multivariate_adaptive_regression_splines

See Also

Start with summary.earth, plot.earth, plotmo, and evimp.

etitanic
evimp
format.earth
mars.to.earth
model.matrix.earth
ozone1
plot.earth.models
plot.earth
plotd
plotmo
predict.earth
residuals.earth
summary.earth
update.earth

Examples

a <- earth(Volume ~ ., data = trees)
summary(a, digits = 2, style = "pmax")

# yields:
#    Call: earth(formula=Volume~., data=trees)
#
#    Volume =
#      23
#      +  5.7 * pmax(0,  Girth -     13)
#      -  2.9 * pmax(0,     13 -  Girth)
#      + 0.72 * pmax(0, Height -     76)
#
#    Selected 4 of 5 terms, and 2 of 2 predictors
#    Estimated importance: Girth Height
#    Number of terms at each degree of interaction: 1 3 (additive model)
#    GCV 11    RSS 213    GRSq 0.96    RSq 0.97

[Package earth version 2.3-2 Index]