earth {earth}R Documentation

Multivariate Adaptive Regression Splines

Description

Build a regression model using the techniques in Friedman's papers "Multivariate Adaptive Regression Splines" and "Fast MARS".

Usage

## S3 method for class 'formula':
earth(formula = stop("no 'formula' arg"),
   data, weights = NULL, wp = NULL, subset = NULL,
   na.action = na.fail, glm = NULL, trace = 0, keepxy = FALSE, ...)

## Default S3 method:
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
    weights = NULL, wp = NULL, subset = NULL,
    na.action = na.fail, glm = NULL, trace = 0, keepxy = FALSE, ...)

## S3 method for class 'fit':
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
    weights = NULL, wp = NULL, subset = NULL,
    na.action = na.fail, glm = NULL, trace = 0,
    penalty = if(degree > 1) 3 else 2, nk = max(21, 2 * ncol(x) + 1),
    degree = 1, linpreds = FALSE, allowed = NULL, thresh = 0.001,
    minspan = 0, newvar.penalty = 0, fast.k = 20, fast.beta = 1,
    pmethod = "backward", nprune = NULL, Object = NULL,
    Get.crit = get.gcv, Eval.model.subsets = eval.model.subsets,
    Print.pruning.pass = print.pruning.pass, Force.xtx.prune = FALSE,
    Use.beta.cache = TRUE, ...)

Arguments

All arguments are optional except formula, or x and y. NAs are not allowed.

To start off, look at the arguments formula, x, y, nk, degree and trace. Many users will find that those arguments are all they need, plus in some cases keepxy, nprune, penalty, and minspan. Users who are interested in GLM models will want to look at the glm argument.

formula Model formula.
data Data frame for formula.
x Matrix or dataframe containing the independent variables.
y Vector containing the response variable, or, in the case of multiple responses, a matrix or dataframe whose columns are the response values for each variable. The "Factors" section below explains how factors are expanded in y. If the y values are very big or very small you may get better results if you scale y first for better numerical stability in the forward pass.
subset Index vector specifying which cases to use i.e. which rows in x to use. Default is NULL, meaning all.
weights Weight vector (not yet supported).
wp Vector of response weights. Default is NULL, meaning no response weights. If specified, wp must have an element for each column of y after factors have been expanded.
Note for mda::mars users: earth's internal normalisation of wp is different from mda::mars. Earth uses wp<-sqrt(wp/mean(wp)) and mda::mars uses wp<-sqrt(wp/sum(wp)). Thus in earth, a wp with all elements equal is equivalent to no wp. For models built with wp, multiply the GCV calculated by mda::mars by length(wp) to compare it to earth's GCV.
na.action NA action. Default is na.fail, and only na.fail is supported.
glm NULL (default) or a list of arguments to glm. See the "Generalised linear models" section below. Example:
earth(y~x, glm=list(family=binomial))
trace Trace earth's execution. Default is 0. Values:
0 none 1 overview 2 forward pass 3 pruning 4 model mats, memory use, more pruning, etc. 5 ...
keepxy Set to TRUE to retain the following in the returned value: x and y (or data), subset, weights, and wp. Default is FALSE.
update.earth and friends will use these instead of searching for them in the environment at the time update.earth is invoked.

The following arguments are for the forward pass.
penalty Generalised Cross Validation (GCV) penalty per knot. Default is if(degree>1) 3 else 2. A value of 0 penalises only terms, not knots. The value -1 is a special case, meaning no penalty, so GCV=RSS/n. Theory suggests values in the range of about 2 to 4. In practice, for big data sets larger values can be useful to force a smaller model. The FAQ sections below has some information on GCVs.
nk Maximum number of model terms before pruning. Includes the intercept. Default is max(21,2*NCOL(x)+1). The number of terms created by the forward pass will be less than nk if there are linearly dependent terms which must be discarded, or if a forward stopping condition is reached before nk terms. See the section below on the forward pass.
degree Maximum degree of interaction (Friedman's mi). Default is 1, meaning build an additive model (i.e. no interaction terms).
linpreds Index vector specifying which predictors should enter linearly, as in lm.
The default is FALSE, meaning all predictors enter in the standard MARS fashion i.e. in hinge functions.
A predictor's index in linpreds is the column number in the input matrix x after factors have been expanded. Examples are given in the FAQ section below.
Note: in the current implementation, the GCV penalty for predictors that enter linearly is the same as that for predictors with knots. That is not quite correct; linear terms should be penalised less.
allowed Function specifying which predictors can interact and how. Default is NULL, meaning all standard MARS terms are allowed.
Earth calls the allowed function just before adding a term in the forward pass. If allowed returns TRUE the term goes into the model as usual; if allowed returns FALSE the term is discarded. Examples are given in the FAQ section below.
Your allowed function should have the following prototype
allowed <- function(degree, pred, parents, namesx, first) {.....}
where
degree is the interaction degree of the candidate term. Will be 1 for additive terms.
pred is the index of the candidate predictor. A predictor's index in linpreds is the column number in the input matrix x after factors have been expanded. parents is the candidate parent term's row in dirs.
namesx is optional and if present is the column names of x after factors have been expanded.
first is optional and if present is TRUE the first time your allowed function is invoked for the current model; and thereafter FALSE.
thresh Forward stepping threshold. Default is 0.001. This is one of the arguments used to decide when forward stepping should terminate. See the section below on the forward pass.
minspan Minimum distance between knots.
Use a value of 1 to consider all x values (which is good if the data are not noisy).
The default value is 0. The value 0 is treated specially and means calculate the minspan internally as per Friedman's MARS paper section 3.8 with alpha = 0.05. Set trace>=2 to see the calculated value.
(The default was 1 prior to Mar 2008 but was changed for compatibility with mda::mars and Friedman's papers).
This argument is intended to increase resistance to runs of noise in the input data. Higher values will increase smoothness in your model.
Note: predictor value extremes are ineligible for knots regardless of the minspan setting, as per the MARS paper equation 45.
newvar.penalty Penalty for adding a new variable in the forward pass (Friedman's gamma, equation 74 in the MARS paper). Default is 0, meaning no penalty for adding a new variable. Useful non-zero values range from about 0.01 to 0.2 — you will need to experiment. This argument can mitigate the effects of colinearity or concurvity in the input data, but anecdotal evidence is that it does not work very well. If you know two variables are strongly correlated then you would do better to delete one of them before calling earth.
fast.k Maximum number of parent terms considered at each step of the forward pass. Friedman invented this parameter to speed up the forward pass (see the Fast MARS paper section 3.0). Default is 20. Values of 0 or less are equivalent to infinity, meaning no Fast MARS. Typical values, apart from 0, are 20, 10, or 5. In general, with a lower fast.k (say 5), earth is faster; with a higher fast.k, or with fast.k disabled (set to 0), earth builds a better model. However it is not unusual to get a slightly better model with a lower fast.k, and you may need to experiment.
fast.beta Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results.

The following arguments are for the pruning pass.
One strategy is to build a large model and then trim it back using update.earth with the arguments below.
pmethod Pruning method. Default is "backward". One of: backward none exhaustive forward seqrep. If y has multiple columns, then only backward or none is allowed. Pruning can take a while if exhaustive is chosen and the model is big (more than about 30 terms). The current version of the leaps package used during pruning does not allow user interrupts (i.e. you have to kill your R session to interrupt; in Windows you can use tskill).
nprune Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms created by the forward pass (but typically not all terms will remain after pruning). Use this to reduce exhaustive search time, or to enforce an upper bound on the model size.

The following arguments are for internal or advanced use.
Object Earth object to be updated, for use by update.earth.
Get.crit Criterion function for model selection during pruning. By default a function that returns the GCV. See the section below on the pruning pass.
Eval.model.subsets Function to evaluate model subsets — see notes in source code.
Print.pruning.pass Function to print pruning pass results — see notes in source code.
Force.xtx.prune Default is FALSE. This argument pertains to subset evaluation in the pruning pass. By default, if y has a single column then earth calls the leaps routines; if y has multiple columns then earth calls EvalSubsetsUsingXtx. The leaps routines are based on the QR decomposition; the EvalSubsetsUsingXtx routine is based on the inverse of X'X. The leaps routines are more accurate but do not support multiple responses. Setting Force.xtx.prune=TRUE forces use of EvalSubsetsUsingXtx, even if y has a single column.
Use.beta.cache Default is TRUE. Using the "beta cache" takes more memory but is faster (by 20% and often much more for large models). The beta cache uses nk * nk * ncol(x) * sizeof(double) bytes. Set Use.beta.cache=FALSE to save memory.
... Dots are passed on to earth.fit.

Value

An object of class "earth" which is a list with the components listed below. Term refers to a term created during the forward pass (each line of the output from format.earth is a term). Term number 1 is always the intercept.

rss Residual sum-of-squares (RSS) of the model (summed over all responses).
rsq 1-rss/rss.null. R-Squared of the model (calculated over all responses). A measure of how well the model fits the training data.
gcv Generalised Cross Validation (GCV) of the model (summed over all responses) The GCV is calculated using the penalty argument. For details of the GCV calculation, see equation 30 in Friedman's MARS paper and earth:::get.gcv.
grsq 1-gcv/gcv.null. An estimate of the predictive power of the model (calculated over all responses). Unlike rsq, in MARS models grsq can be negative. A negative grsq would indicate a severely over parameterised model — a model that would not generalise well even though it may be a good fit to the training data. Watch the GRSq take a nose dive in this example:
earth(mpg~., data=mtcars, pmethod="none", trace=3)
bx Matrix of basis functions applied to x. Each column corresponds to a selected term. Each row corresponds to a row in in the input matrix x, after taking subset. See model.matrix.earth for an example of bx handling. Example bx:
    (Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(...
[1,]          1           0.0           4.6                   0
[2,]          1           0.0           4.3                   0
[3,]          1           0.0           4.1                   0
...
dirs Matrix with one row per MARS term, and with with ij-th element equal to

0 if predictor j is not in term i
-1 if a term factor of the form pmax(c - xj) is in term i
1 if a term factor of the form pmax(xj - c) is in term i
2 if predictor j enters term i linearly [added Jan 2008].

This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Example dirs:
                       Girth Height
(Intercept)                0  0 #intercept
h(Girth-12.9)              1  0 #2nd term uses Girth
h(12.9-Girth)             -1  0 #3rd term uses Girth
h(Girth-12.9)*h(Height-76) 1  1 #4th term uses Girth and Height
...
cuts Matrix with ij-th element equal to the cut point for predictor j in term i. This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Note for programmers: the precedent is to use dirs for term names etc. and to only use cuts where cut information needed. Example cuts:
                           Girth Height
(Intercept)                    0   0  #intercept, no cuts
h(Girth-12.9)               12.9   0  #2nd term has cut at 12.9
h(12.9-Girth)               12.9   0  #3rd term has cut at 12.9
h(Girth-12.9)*h(Height-76)  12.9  76  #4th term has two cuts
...
selected.terms Vector of term numbers in the best model. Can be used as a row index vector into cuts and dirs. The first element selected.terms[1] is always 1, the intercept.
prune.terms A matrix specifying which terms appear in which subsets. The row index of prune.terms is the model size. (The model size is the number of terms in the model. The intercept is considered to be a term.) Each row is a vector of term numbers for the best model of that size. An element is 0 if the term is not in the model, thus prune.terms is a lower triangular matrix, with dimensions nprune x nprune. The model selected by the pruning pass is at row number length(selected.terms). Example prune.terms:
[1,]    1  0  0  0  0  0  0 #intercept-only model
[2,]    1  2  0  0  0  0  0 #best 2 term model uses terms 1,2
[3,]    1  2  4  0  0  0  0 #best 3 term model uses terms 1,2,4
[4,]    1  2  6  9  0  0  0 #and so on
...
rss.per.response A vector of the RSS for each response. Length is the number of responses i.e. ncol(y) after factors in y have been expanded. The rss component above is equal to sum(rss.per.response).
rsq.per.response A vector of the R-Squared for each response. Length is the number of responses.
gcv.per.response A vector of the GCV for each response. Length is the number of responses. The gcv component above is equal to sum(gcv.per.response).
grsq.per.response A vector of the GRSq for each response. Length is the number of responses.
rss.per.subset A vector of the RSS for each model subset generated by the pruning pass. Length is nprune. For multiple reponses, the RSS is summed over all responses for each subset. The null RSS (i.e. the RSS of an intercept only-model) is rss.per.subset[1]. The rss above is
rss.per.subset[length(selected.terms)].
gcv.per.subset A vector of the GCV for each model in prune.terms. Length is is nprune. For multiple reponses, the GCV is summed over all responses for each subset. The null GCV (i.e. the GCV of an intercept-only model) is gcv.per.subset[1]. The gcv above is gcv.per.subset[length(selected.terms)].
fitted.values Fitted values. A matrix with dimensions nrow(y) x ncol(y) after factors in y have been expanded.
residuals Residuals. A matrix with dimensions nrow(y) x ncol(y) after factors in y have been expanded.
coefficients Regression coefficients. A matrix with dimensions length(selected.terms) x ncol(y) after factors in y have been expanded. Each column holds the least squares coefficients from regressing that column of y on bx. The first row holds the intercept coefficients.
penalty The GCV penalty used during pruning. A copy of earth's penalty argument.
ppenalty Deprecated. A copy of earth's penalty argument.
call The call used to invoke earth.
terms Model frame terms. This component exists only if the model was built using earth.formula.
namesx Column names of x, generated internally by earth when necessary so each column of x has a name. Used, for example, by predict.earth to name columns if necessary.
namesx.org Original column names of x.
x
y
data
subset
wp
weights Copies of the corresponding arguments to earth. Only exist if keepxy=TRUE.

The following fields are NULL unless earth's glm argument is used.
glm.list List of GLM models. Each element is the value returned by earth's internal call to glm for each response.
Thus if there is a single response (or a single binomial pair, see the "Binomial pairs" section below) this will be a one element list and you access the GLM model with my.earth.model$glm.list[[1]].
glm.coefficients GLM regression coefficients. Analogous to the coefficients field described above but for the GLM model(s). A matrix with dimensions length(selected.terms) x ncol(y) after factors in y have been expanded. Each column holds the coefficients from the GLM regression of that column of y on bx. This duplicates, for convenience, information in glm.list.
glm.bpairs NULL unless there are paired binomial columns. A logical vector, derived internally by earth, or a copy the bpairs specified by the user in the glm list. See the "Binomial pairs" section below.

Note

Contents

. Other implementations
. Limitations
. Multiple response models
. Generalised linear models
. Factors
. Binomial pairs
. The forward pass
. The pruning pass
. Execution time
. Memory use
. Using earth with fda and mda
. Migrating from mda::mars
. Standard model functions
. Frequently asked questions

Other implementations

The results are similar to but not identical to other Multivariate Adaptive Regression Splines implementations. The differences stem from the forward pass where very small implementation differences (or perturbations of the input data) can cause rather different selection of terms and knots (although similar GRSq's). The backward passes give identical or near identical results, given the same forward pass results.

The source code of earth is derived from mars in the mda package written by by Trevor Hastie and Robert Tibshirani. See also mars.to.earth.

The term "MARS" is trademarked and licensed exclusively to Salford Systems http://www.salfordsystems.com. Their implementation uses an engine written by Friedman and offers more features than earth.

StatSoft also have an implementation which they call MARSplines http://www.statsoft.com/textbook/stmars.html.

Limitations

The following aspects of MARS are mentioned in Friedman's papers but not implemented in earth:

i) Piecewise cubic models
ii) Model slicing (plotmo goes part way)
iii) Handling missing values
iv) Automatic grouping of categorical predictors into subsets
v) Fast MARS h parameter
vi) Cross validation to determine penalty

Multiple response models

If y has k columns (after possible expansion of factors to dummy variables) then earth builds k simultaneous models. Each model has the same set of basis functions (the same bx, selected.terms, dirs and cuts) but different coefficients (the returned coefficients will have k columns). The models are built and pruned as usual but with the GCVs and RSSs averaged across all k responses.

Since earth attempts to optimise for all models simultaneously, the results will not be as "good" as building the models independently i.e. the GCV of the combined model will usually not be as good as the GCVs for independently built models. However, the combined model may be a better model in other senses, depending on what you are trying to achieve. For example, it could be useful for earth to select the set of MARS terms that is best across all responses. This would typically be the case in a multiple response logistic model if some responses have a very small number of successes.

You may want to scale your y columns before calling earth so each y column gets equal weight during model building (the RSS for large y's is bigger than for small y's.). You could also use the wp argument.

Here are a couple of (artificial) examples to show some of the ways multiple responses can be specified. Note that data.frames can't be used on the left side of a formula so cbind is used in the first example. The examples use the standard technique of specifying a tag lvol= to name a column.

earth(cbind(Volume,lvol=log(Volume)) ~ ., data=trees)
attach(trees)
earth(data.frame(Girth,Height), data.frame(Volume,lvol=log(Volume)))
For more details on using residual errors averaged over multiple responses see section 4.1 of Hastie, Tibshirani, and Buja Flexible Discriminant Analysis by Optimal Scoring, JASA, December 1994 http://www-stat.stanford.edu/~hastie/Papers/fda.pdf.

Generalised linear models

Earth builds a GLM model if the glm argument is specified. Earth builds the model as usual and then invokes glm on the MARS basis matrix bx.

In more detail, the model is built as follows. Earth first builds a standard MARS model, including the internal call to lm.fit on bx after the pruning pass. (See "The forward pass" and "The pruning pass" sections below). Thus knot positions and terms are determined as usual and all the standard fields in earth's return value will be present. Earth then invokes glm for the response on bx with the parameters specified in the glm argument to earth. For multiple reponse models, the call to glm is repeated independently for each response. The results go into three extra fields in earth's return value: glm.list, glm.coefficients, and glm.bpairs.

Earth's internal call to glm is made with the glm arguments x, y, and model set TRUE.

Use summary(my.model) as usual to see the model. Use summary(my.model, details=TRUE) to see more details, but note that the printed P-values of the GLM coefficients are meaningless. This is because of the amount of preprocessing by earth — the mantra is "variable selection overstates significance". Use plot(my.model$glm.list[[1]]) to plot the (first) glm model.

The examples below show how to specify earth-glm models. The examples are only to illustrate the syntax and not necessarily useful models. In the examples pmethod="none", otherwise with these artificial models earth tends to prune away everything except the intercept term. You wouldn't normally use pmethod="none". Also, trace=1, so if you run these examples you can see how earth expands the input matrices (see the "Factors" and "Binomial pairs" sections below).

1. Two-level factor or logical response. The response is converted to a single column of 1s and 0s.

a1 <- earth(survived ~ ., data=etitanic,   # c.f. Harrell "Reg Mod Strat" ch. 12
             degree=2, trace=1,
             glm=list(family=binomial))

a1a <- earth(etitanic[,-2], etitanic[,2],  # equivalent but using earth.default
             degree=2, trace=1,
             glm=list(family=binomial))
2. Factor response. This example is for a factor with more than two levels. (For factors with just two levels, see the previous example.) The factor pclass is expanded to three indicator columns (whereas in a direct call to glm, pclass would be treated as logical: the first level versus all other levels).
a2 <- earth(pclass ~ ., data=etitanic, trace=1, 
            glm=list(family=binomial))
3. Binomial model specified with a column pair. This is a single reponse model but specified with a pair of columns: see the "Binomial pairs" section below. For variety, this example uses a probit link and (unnecessarily) increases maxit.
ldose <- rep(0:5, 2) - 2 # V&R 4th ed. p. 191
sex <- factor(rep(c("male", "female"), times=c(6,6)))
numdead <- c(1,4,9,13,18,20,0,2,6,10,12,16)
pair <- cbind(numdead, numalive=20 - numdead)
a3 <- earth(pair ~ sex + ldose, trace=1, pmethod="none",
            glm=list(family=binomial(link=probit), maxit=100))
4. Double binomial response (i.e. a multiple response model) specified with two column pairs.
numdead2 <- c(2,8,11,12,20,23,0,4,6,16,12,14) # bogus data
doublepair <- cbind(numdead, numalive=20-numdead,
                    numdead2=numdead2, numalive2=30-numdead2)
a4 <- earth(doublepair ~ sex + ldose, trace=1, pmethod="none",
            glm=list(family="binomial"))
5. Poisson model.
counts <- c(18,17,15,20,10,20,25,13,12) # Dobson 1990 p. 93
outcome <- gl(3,1,9)
treatment <- gl(3,3)
a5 <- earth(counts ~ outcome + treatment, trace=1, pmethod="none",
            glm=list(family=poisson))
6. Standard earth model, the long way.
a6 <- earth(numdead ~ sex + ldose, trace=1, pmethod="none",
            glm=list(family=gaussian(link=identity)))
print(a6$coefficients == a6$glm.coefficients)  # all TRUE

Factors

Factors in x: Earth treats factors in x in the same way as standard R models such as lm (where x is taken to mean the right side of the formula). Thus factors are expanded using the current setting of options("contrasts").

Factors in y: A two level factor (or logical) is converted to a single indicator column of 1s and 0s. A factor with three or more levels is converted into k indicator columns of 1s and 0s, where k is the number of levels (the contrasts matrix is diagonal, see contr.earth.response). This happens regardless of the options("contrasts") setting and regardless of whether the factors are ordered or unordered. For example, if a column in y is a factor with levels A, B, and C, the column will be expanded to three columns like this (the actual data will vary but each row will have a single 1):

A B C  # one column for each factor level
0 1 0  # each row has a single 1
1 0 0
0 0 1
0 0 1
0 0 1
...
In distinction, a standard treatment contrast in a model with an intercept would have no first "A" column.

This expansion to multiple columns means that earth will build a multiple response model as described in the "Multiple responses" section above.

Paired binomial response columns in y are treated specially. — see the "Binomial pairs" section below.

Use trace=1 or higher to see the column names of the x and y matrices after factor processing. Use trace=4 to see the first few rows of x and y after factor processing.

Here is an example which uses the etitanic data to predict the passenger class (not necessarily a sensible thing to do but provides a good example here):

> data(etitanic)
> head(etitanic) # pclass and sex are unordered factors

  pclass survived    sex    age sibsp parch
1    1st        1 female 29.000     0     0
2    1st        1   male  0.917     1     2
3    1st        0 female  2.000     1     2

> earth(pclass ~ ., data=etitanic, trace=1) # note col names in x and y below

x is a 1046 by 5 matrix: 1=survived, 2=sexmale, 3=age, 4=sibsp, 5=parch
y is a 1046 by 3 matrix: 1=1st, 2=2nd, 3=3rd
rest deleted...
This way of handling factors is incompatible with previous versions and was introduced in earth version 2.0 [June 2008].

Binomial pairs

This section is only relevant if you use earth's glm argument with a binomial or quasibinomial family.

Users of the glm function will be familiar with the technique of specifying a binomial response as a two-column matrix, with a column for the number of successes and a column for the failures. Earth automatically detects when such columns are present in y (by looking for adjacent columns which both have entries greater than 1). The first column only is used to build the standard earth model. Both columns are then passed to earth's internal call to glm. As always, use trace=1 to see how the columns of x and y are expanded.

You can override this automatic detection by including a bpairs parameter. This is usually (always?) unnecessary. For example

glm=list(family=binomial, bpairs=c(TRUE, FALSE))
specifies that there are two columns in the response with the second paired with the first. These examples
glm=list(family=binomial, bpairs=c(TRUE, FALSE, TRUE, FALSE))
glm=list(family=binomial, bpairs=c(1,3)) # equivalent
specify that the 1st and 2nd columns are a binomial pair and the 3rd and 4th columns another binomial pair.

The forward pass

Understanding the details of the forward and pruning passes will help you understand earth's return value and the admittedly large number of arguments. The result of the forward pass is the MARS basis matrux bx and the set of terms defined by dirs and cuts (these are all fields in earth's return value, but the bx here includes all terms before trimming back to selected.terms).

The forward pass adds terms in pairs until the first of the following conditions is met:

i) reach maximum number of terms (nterms >= nk)
ii) reach DeltaRSq threshold (DeltaRSq < thresh) where DeltaRSq is the difference in R-Squared caused by adding the current term pair
iii) reach max RSq (RSq > 1-thresh) where thresh is the argument to earth
iv) reach min GRSq (GRSq < -10).

Set trace>=1 to see the stopping condition and trace>=2 to trace the forward pass

Note that GCVs (via GRSq) are used during the forward pass only as one of the stopping conditions and in trace prints. Changing the penalty argument does not change the knot positions.

The various stopping conditions mean that the actual number of terms created by the forward pass may be less than nk. There are other reasons why the actual number of terms may be less than nk: (i) the forward pass discards one side of a term pair if it adds nothing to the model — but the forward pass counts terms as if they were actually created in pairs, and, (ii) as a final step, the forward pass deletes linearly dependent terms, if any, so all terms in dirs and cuts are independent. And remember that the pruning pass will further discard terms.

The pruning pass

The pruning pass is handed the sets of terms created by the forward pass. Its job is to find the subset of those terms that gives the lowest GCV. The following description of the pruning pass explains how various fields in earth's returned value are generated.

The pruning pass works like this: it determines the subset of terms in bx (using pmethod) with the lowest RSS for each model size in 1:nprune (see the Force.xtx.prune argument above for some details). It saves the RSS and term numbers for each such subset in rss.per.subset and prune.terms. It then applies the Get.crit function with penalty to rss.per.subset to yield gcv.per.subset. It updates bx by keeping only the selected.terms. Finally it chooses the model with the lowest value in gcv.per.subset, and puts its term numbers into selected.terms.

After the pruning pass, earth runs lm.fit to determine the fitted.values, residuals, and coefficients, by regressing the response y on bx. If y has multiple columns then lm.fit is called for each column.

If a glm argument is passed to earth, earth runs glm on y in addition to the above call to lm.fit.

Set trace>=3 to trace the pruning pass.

By default Get.crit is earth:::get.gcv. Alternative Get.crit functions can be defined. See the source code of get.gcv for an example.

Execution time

For a given set of input data, the following can increase the speed of the forward pass:

i) decreasing fast.k
ii) decreasing nk
iii) decreasing degree
iv) increasing threshold
v) increasing min.span.

The backward pass is normally much faster than the forward pass, unless pmethod="exhaustive". Reducing nprune reduces exhaustive search time. One strategy is to first build a large model and then adjust pruning parameters such as nprune using update.earth.

The following very rough rules of thumb apply for large models. Using minspan=1 instead of the default 0 will increase times by 20 to 50%. Using fast.k=5 instead of the default 20 can give you substantial speed gains but will sometimes give you a much smaller GRSq. Using an allowed function slows down model building by about 10%.

Memory use

Earth does not impose specific limits on the model size. Model size is limited only by the amount of memory on your system, the maximum memory addressable by R, and your patience. On a 32 bit machine with x and y of type double (no factors), the number of bytes of memory used by earth is about

8 * (nk^2 * ncol(x) + (nrow(x) * (3 + 2*nk + ncol(x)/2))).
Earth prints the results of the above calculation if trace>=4. Memory use peaks in the forward pass. The bulk of the forward pass is implemented in C. It allocates memory "outside of R" and so memory.size will not report the memory it uses.

Before calling earth, R itself will of course allocate memory over and above the amount calculated above. To reduce total memory usage, it sometimes helps to remove variables and call gc before invoking earth.

Earth uses more memory if any elements of the x and y arguments are not double, because it must convert them to double internally. The same applies if the subset argument is used. Earth uses more memory if trace>=2 (because DUP=TRUE is required to pass predictor names to earth's internal call to .C). Increasing the degree does not change the memory requirement but greatly increases the running time.

Here is an example of memory use: the earth test suite builds a model using earth.default with a 1e4 by 100 input matrix with nk=21. The Windows XP task manager reports that the peak memory use when building this model is 47 MBytes. Using the formula interface to earth pushes memory to 62 MBytes. Increasing the number of rows in the input matrix to 1e5 pushes memory to 240 MBytes. These numbers are for earth version 1.3 (Mar 2008), which requires less memory than previous versions.

Using earth with fda and mda

Earth can be used with fda and mda in the mda package. Earth will generate a multiple response model. Use the fda/mda argument keep.fitted=TRUE if you want to call plot.earth later (actually only necessary for large datasets, see the description of keep.fitted in fda). Use the earth argument keepxy=TRUE if you want to call update.earth or plotmo later. Example:

library(mda)
(a <- fda(Species~., data=iris, keep.fitted=TRUE, method=earth, keepxy=TRUE))
plot(a)
summary(a$fit)  # examine earth model embedded in fda model
plot(a$fit)
plotmo(a$fit, ycolumn=1, ylim=c(-1.5,1.5), clip=FALSE)
plotmo(a$fit, ycolumn=2, ylim=c(-1.5,1.5), clip=FALSE)

Migrating from mda::mars

Using earth is usually just a matter of changing the call from "mars" to "earth". But there are a few argument differences and earth will issue a warning if you give it a mars-only argument. See also mars.to.earth.

The resulting model will be similar but not identical because of small implementation differences which are magnified by the inherent instability of the MARS forward pass.

If you are further processing the output of earth you will need to consider differences in the returned value. The header of the source file mars.to.earth.R describes these. Perhaps the most important is that mars returns the MARS basis matrix in a field called "x" whereas earth returns "bx". Also, earth returns "dirs" rather than "factors", and in earth this matrix can have entries of value 2 for linear predictors.

Standard model functions

Standard model functions such as case.names are provided for earth objects and are not explicitly documented. Many of these give warnings when the results are not what you may expect. Pass warn=FALSE to these functions to turn of just these warnings.

FREQUENTLY ASKED QUESTIONS

What are your plans for earth?

We would like to get plotmo working with factors. Then functions for model cross-validation. Proper support of weights (to allow boosting) would be nice too.

How can I establish variable importance?

Use the evimp function. See its help page for more details.

The summary.earth function lists the predictors in order of estimated importance using the nsubsets criterion of evimp.

Which predictors were added to the model first?

You can see the forward pass adding terms with trace=2 or higher. But remember, pruning will usually remove some of the terms. You can also use

summary(my.model, decomp="none")
which will list the basis functions remaining after pruning, in the order they were added by the forward pass.

Which predictors are actually in the model?

The following function will give you list of predictors in the model:

get.used.pred.names <- function(obj) # obj is an earth object
{
  any1 <- function(x) any(x != 0)    # like any but no warning if x is double
  names(which(apply(obj$dirs[obj$selected.terms,,drop=FALSE],2,any1)))
}
How can I train on one set of data and test on another?

The example below demonstrates one way to train on 80% of the data and test on the remaining 20%.

train.subset <- sample(1:nrow(trees), .8 * nrow(trees))
test.subset <- (1:nrow(trees))[-train.subset]
a <- earth(Volume ~ ., data = trees[train.subset, ])
yhat <- predict(a, newdata = trees[test.subset, ])
y <- trees$Volume[test.subset]
print(1 - sum((y - yhat)^2) / sum((y - mean(y))^2)) # print R-Squared
In practice a dataset larger than the one in the example should be used for splitting. The model variance is too high with this small set — run the example a few times to see how the model changes as sample splits the dataset differently on each run. Also, remember that the test set should not be used for parameter tuning because you will be optimising for the test set — use GCVs or separate parameter selections sets for that.

Why do I get fewer terms than nk, even with prune="none"?

There are several conditions that can terminate the forward pass, and reaching nk is just one of them. See the section above on the forward pass.

Why do I get fewer terms than nprune?

The pruning pass selects a model with the lowest GCV that has nprune or fewer terms. Thus the nprune argument specifies the maximum number of permissible terms in the final pruned model.

You can work around this because you will get exactly nprune terms if you specify pmethod="none". Compare the output of these two examples:

earth(Volume ~ ., data = trees, trace=3)
earth(Volume ~ ., data = trees, trace=3, pmethod="none")
Another way to get exactly nprune terms is to specify penalty = -1. This special value of penalty causes earth to set the GCV to RSS/nrow(x). Since the training RSS always decreases with more terms, the pruning pass will choose the maximum allowable number of terms. An example:
earth(Volume ~ ., data = trees, trace=3, penalty=-1)

Is it best to hold down model size with nk or nprune?

If you want the best possible small model, build a big model (by specifying a big nk) and prune it back (by specifying a small nprune). This is better than directly building a small model by specifying a small nk, because the pruning pass can look at all the terms whereas the forward pass can only see one term ahead. However, it is much faster building a small model with a small nk.

Can you give an example of the linpreds argument?

With linpreds you can specify which predictors should enter linearly, instead of in hinge functions. The lindeps argument does not stipulate that a predictor must enter the model, only that if it enters it should enter linearly. Starting with

a1 <- earth(Volume ~ ., data = trees)
plotmo(a1)
we see in the plotmo graphs or by running evimp that Height is not as important as Girth. For collaborative evidence that Girth is a more reliable indicator of Volume you can use pairs:
pairs(trees, panel = panel.smooth)
Since we want the simplest model that describes the data, we can specify that Height should enter linearly:
a2 <- earth(Volume ~ ., data = trees, linpreds = 2)  # 2 is Height column
summary(a2)
which yields
Expression:
  -7.41
  + 0.418 * Height
  +  5.86 * pmax(0,  Girth -   12.9)
  -  2.41 * pmax(0,   12.9 -  Girth)
In this example, the second simpler model has almost the same RSS as the first model. We can make both Girth and Height enter linearly with
a3 <- earth(Volume ~ ., data = trees, linpreds = c(1,2))
or with (the single TRUE is recycled to the length of linpreds)
a4 <- earth(Volume ~ ., data = trees, linpreds = TRUE)
But specifying that all predictors should enter linearly is not really a useful thing to do. In our simple example, the all-linear MARS model is the same as a standard linear model
a5 <- lm(Volume ~ ., data = trees)
(compare the summary for each) but in general that will not be true. Earth will not include a linear predictor if that predictor does not improve the model.

Can you give an example of the allowed argument?

You can specify how variables are allowed to enter MARS terms with the allowed argument.

The interface is flexible but requires a bit of programming. We start with a simple example, which completely excludes one predictor from the model:

example1  <- function(degree, pred, parents)   # returns TRUE if allowed
{
    pred != 2  # disallow predictor 2, which is "Height"
}
a1 <- earth(Volume ~ ., data = trees, allowed = example1)
print(summary(a1))
But that's not much use, because it's simpler to exclude the predictor from the input matrix when invoking earth:
a2 <- earth(Volume ~ . - Height, data = trees)
The example below is more useful. It prevents the specified predictor from being used in interaction terms. (The example is artificial because it's unlikely you would want to single out humidity from interactions in the ozone data.)

The parents argument is the candidate parent's row in the dirs matrix (dirs is described in the Value section above). Each entry of parents is 0, 1, -1, or 2, and you index parents on the predictor index. Thus parents[pred] is 0 if pred is not in the parent term.

example2 <- function(degree, pred, parents)
{
    # disallow humidity in terms of degree > 1
    # 3 is the "humidity" column in the input matrix
    if (degree > 1 && (pred == 3 || parents[3]))
        return(FALSE)
    TRUE
}
a3 <- earth(O3 ~ ., data = ozone1, degree = 2, allowed = example2)
print(summary(a3))
The following example allows only the specified predictors in interaction terms:
example3 <- function(degree, pred, parents)
{
    # allow only humidity and temp in terms of degree > 1
    # 3 and 4 are the "humidity" and "temp" columns
    allowed.set = c(3,4)
    if (degree > 1 &&
           (all(pred != allowed.set) || any(parents[-allowed.set])))
        return(FALSE)
    TRUE
}
a4 <- earth(O3 ~ ., data = ozone1, degree = 2, allowed = example3)
print(summary(a4))
Using predictor names instead of indices in the "allowed" function. You can use predictor names instead of indices using the optional namesx argument. If present, namesx is the column names of x after factors have been expanded. The first example (the one that disallows Height) can be rewritten as
example1a  <- function(degree, pred, parents, namesx)
{
    namesx[pred] != "Height"  
}
Comparing strings is inefficient and the above example can be rewritten a little more efficiently using the optional first argument. If present, this is TRUE the first time your allowed function is called for the current model and thereafter FALSE.
iheight <- 0    # column index of "Height"

example1b  <- function(degree, pred, parents, namesx, first)
{
    if (first) {
        # first time this function is invoked, so
        # stash column index of "Height" in iheight
        iheight <<- which(namesx == "Height")   # note use of <<- not <-
        if (length(iheight) != 1)
            stop("no Height in ", paste(namesx, collapse=" "))
    }
    pred != iheight
}
The basic MARS model building strategy is always applied even when there is an allowed function. For example, earth considers a term for addition only if all factors of that term except the new one are already in a term. This means that an allowed function that inhibits, say, all degree 2 terms will also effectively inhibit higher degrees too, because there will be no degree 2 terms for earth to extend to degree 3.

You can expect model building to be about 10% slower with an allowed function.

How does summary.earth order terms?

With decomp="none", the terms are ordered as created by the forward pass.

With the default decomp="anova", the terms are ordered in increasing order of interaction. In detail:
(i) terms are sorted first on degree of interaction
(ii) then terms with a linpreds linear factor before standard terms
(iii) then on the predictors in the factors (in the order of the columns in the input matrix)
(ii) and finally on increasing knot values.

It's actually earth:::reorder.earth that does the ordering.

A slightly different and less predictable "anova" ordering was used prior to earth version 1.2 (Jan 2008). The change also affects the ordering of degree2 plotmo plots.

summary.earth lists predictors with weird names that aren't in x. What gives?

You probably have factors in your x matrix, and earth is applying contrasts. See the "Factors" section above.

Why pmax and not max in the output from summary.earth (with style="pmax")?

With pmax the earth equation is an R expression that can handle multiple cases. Thus the expression is consistent with the way predict.earth works — you can give predict multiple cases (i.e. multiple rows in the input matrix) and it will return a vector of predicted values.

What about boosting MARS?

If you want to boost, use boosted trees rather than boosted MARS — you will get better results.

More precisely, although gradient boosted MARS gives better results than plain MARS, if you would like to improve prediction performance (at the cost of a more complicated and less interpretable model) you will usually get better results with boosted trees (via, say, the gbm package) than with boosted MARS. See Gillian Ward (2007) Statistics in Ecological Modeling: Presence-Only Data and Boosted Mars (Doctoral Thesis) http://www-stat.stanford.edu/~hastie/THESES/Gill_Ward.pdf.

This could change as the state of the art advances.

What about bagging MARS?

Max Kuhn's caret package provides functions for bagging MARS (and for parameter selection).

What is a GCV, in simple terms?

GCVs are important for MARS because the pruning pass uses GCVs to evaluate model subsets.

In general terms, when testing a model (not necessarily a MARS model) we want to test generalisation performance and so want to measure error on independent data i.e. not on the training data. Often a decent set of independent data is unavailable and so we resort to cross validation or leave-one-out methods. But that can be painfully slow. For certain forms of model we can use a formula to approximate the error that would determined by leave-one-out validation — that approximation is the GCV. The formula adjusts (i.e. increases) the training RSS to take into account the flexibility of the model. Summarising, the GCV approximates the RSS that would be measured on independent data. Even when the approximation is not that good, it is usually good enough for comparing models during pruning.

GCVs were introduced by Craven and Wahba, and extended by Friedman for MARS. See Hastie et al. p216 and the Friedman MARS paper. GCV stands for "Generalised Cross Validation", a perhaps misleading term. For example, the terms selected by the pruning pass for each subset size are the same whether the GCV or RSS is used to select terms. Actual cross validation during pruning would choose terms for each subset that are different in general from those selected by the RSS on the full training set.

The GRSq measure used in the earth package standardises the raw GCV, in the same way that R-Squared standardises the RSS.

If GCVs are so important, why don't linear models use them?

The more flexible a model, the more its propensity to overfit the training data. (An overfit model fits the training data well but will not give good predictions on new data.) Linear models are constrained, with usually only a few parameters, and don't have the tendency to overfit the data like more flexible models such as MARS. This means that for linear models, the RSS on the data used to build the model is usually an adequate measure of generalisation ability.

This is no longer true if you do automatic variable selection on linear models, because the process of selecting variables increases the flexibility of the model. Hence the AIC — as used in, say, drop1. The GCV, AIC, and friends are means to the same end. Depending on what information is available during model building. we use one of these statistics to estimate model generalisation performance for the purpose of selecting a model.

What happened to get.nterms.per.degree, get.nused.preds.per.subset, and reorder.earth?

From release 1.3.0, some earth functions are no longer public, to help simplify the user interface. The functions are still available (and stable) if you need them — use for example earth:::reorder.earth().

What happened to the ppenalty argument?

This was removed (release 1.3.1) because it is no longer needed. The ponly argument of update.earth is a more flexible way of achieving the same end.

Author(s)

Stephen Milborrow, derived from mda::mars by Trevor Hastie and Robert Tibshirani.

The approach used for GLMs was motivated by work done by Jane Elith and John Leathwick (a representative paper is listed in the references below).

The evimp function uses ideas from Max Kuhn's caret package http://cran.r-project.org/web/packages/caret/index.html.

Users are encouraged to send feedback — use milbo AT sonic PERIOD net http://www.milbo.users.sonic.net.

References

The primary references are the Friedman papers. Readers may find the MARS section in Hastie, Tibshirani, and Friedman a more accessible introduction. Faraway takes a hands-on approach, using the ozone data to compare mda::mars with other techniques. (If you use Faraway's examples with earth instead of mars, use $bx instead of $x). Friedman and Silverman is recommended background reading for the MARS paper. Earth's pruning pass uses the leaps package which is based on techniques in Miller.

Faraway (2005) Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23

Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1–141 http://www.salfordsystems.com/doc/MARS.pdf

Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 http://www.milbo.users.sonic.net/earth/Friedman-FastMars.pdf, http://www-stat.stanford.edu/research/index.html

Friedman and Silverman (1989) Flexible Parsimonious Smoothing and Additive Modeling Technometrics, Vol. 31, No. 1. http://links.jstor.org/sici?sici=0040-1706%28198902%2931%3A1%3C3%3AFPSAAM%3E2.0.CO%3B2-Z

Hastie, Tibshirani, and Friedman (2001) The Elements of Statistical Learning http://www-stat.stanford.edu/~hastie/pub.htm

Leathwick, J.R., Rowe, D., Richardson, J., Elith, J., & Hastie, T. (2005) Using multivariate adaptive regression splines to predict the distributions of New Zealand's freshwater diadromous fish Freshwater Biology, 50, 2034-2052 http://www-stat.stanford.edu/~hastie/pub.htm, http://www.botany.unimelb.edu.au/envisci/about/staff/elith.html

Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression http://users.bigpond.net.au/amiller

See Also

Start with summary.earth, plotmo, and evimp.

contr.earth.response, etitanic, evimp, format.earth, mars.to.earth, model.matrix.earth, ozone1, plot.earth.models, plot.earth, plotmo, predict.earth, residuals.earth, summary.earth, update.earth

Examples

a <- earth(Volume ~ ., data = trees)
summary(a, digits = 2, style = "pmax")

# yields:
#    Call: earth(formula=Volume~., data=trees)
#
#    Volume =
#      23
#      +  5.7 * pmax(0,  Girth -     13)
#      -  2.9 * pmax(0,     13 -  Girth)
#      + 0.72 * pmax(0, Height -     76)
#
#    Selected 4 of 5 terms, and 2 of 2 predictors
#    Estimated importance: Girth Height
#    Number of terms at each degree of interaction: 1 3 (additive model)
#    GCV 11    RSS 213    GRSq 0.96    RSq 0.97

[Package earth version 2.0-2 Index]