earth {earth}R Documentation

Earth: Multivariate Adaptive Regression Splines

Description

Build a regression model using the techniques in Friedman's papers ‘Multivariate Adaptive Regression Splines’ and ‘Fast MARS’.

Usage

## S3 method for class 'formula':
earth(formula, data, ...)

## Default S3 method:
earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"),
      subset = NULL, weights = NULL, na.action = na.fail,  
      penalty = if(degree > 1) 3 else 2, trace = 0,
      degree = 1, nk = max(21, 2 * NCOL(x) + 1), 
      thresh = 0.001, minspan = 0, newvar.penalty = 0, 
      fast.k = 20, fast.beta = 1, fast.h = NULL,
      pmethod = "backward", ppenalty = penalty, nprune = NULL,
      Object  = NULL, Get.crit = get.gcv,
      Eval.model.subsets = eval.model.subsets,
      Print.pruning.pass = print.pruning.pass, ...)

Arguments

All arguments are optional except formula, or x and y. The data, or x and y, arguments are treated as numeric. NAs are not allowed.

To start off, look at the arguments formula, x, y, trace, degree, nk, and nprune.

formula Model formula.
data Data frame.
x Matrix containing the independent variables.
y Vector containing the response variable. If the y values are very big or very small then you may get better results if you scale y first.
subset Index vector specifying which rows in x and elements of y to use. Default is NULL, meaning all.
weights Weight vector (not yet supported).
na.action NA action. Default is na.fail, and only na.fail is supported.
penalty GCV penalty per knot. Default is if(degree>1) 3 else 2. A value of 0 penalises only terms, not knots. The value -1 is a special case, meaning no penalty, so GCV=RSS/n. Theory suggests values in the range of about 2 to 3. In practice, larger values can be useful for big models. See also ppenalty.
trace Trace earth's execution. Default is 0. Values:
0 none 1 overview 2 forward 3 pruning 4 more pruning 5 ...

The following arguments are for the forward pass
degree Maximum degree of interaction (Friedman's mi). Default is 1, meaning build an additive model.
nk Maximum number of model terms before pruning. Includes the intercept. Default is max(21,2*NCOL(x)+1). The number of terms created by the forward pass will be less than nk if there are linearly dependent terms which must be discarded, or if a forward stopping condition is reached before nk terms. See the section below on the forward pass.
thresh Forward stepping threshold. This is one of the arguments used to decide when forward stepping should terminate. See the section below on the forward pass. Default is 0.001.
minspan Minimum distance between knots. Set trace>=2 to see the calculated value. Values:
<0 add to the internally calculated min span (i.e. decrease span).
0 (default) use internally calculated min span as per Friedman's MARS paper section 3.8 with alpha = 0.05. Intended to increase resistance to runs of noise in the input data.
>0 use instead of the internally calculated min span. Thus a value of 1 means consider all knots.
newvar.penalty Penalty for adding a new variable in the forward pass (Friedman's gamma, equation 74 in the MARS paper). This argument can mitigate the effects of collinearity or concurvity in the input data. Default is 0. Useful non-zero values range from about 0.01 to 0.2 — you will need to experiment.
fast.k Maximum number of considered parent terms, as as described in Friedman's Fast MARS paper section 3.0. Default is 20. The special value -1 is equivalent to infinity, meaning no Fast MARS. Typical values, apart from -1, range from about 20 to 5, in steps of 5 or 10.
fast.beta Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results.
fast.h Fast MARS h, as described in the Fast MARS paper section 4.0. (not yet implemented).

The following arguments are for the pruning pass
pmethod Pruning method. One of: backward none exhaustive forward seqrep. Default is "backward". Model subset evaluation for pruning uses the leaps package. Pruning can take a while if exhaustive is chosen and the model is big (more than about 30 terms). The current version of leaps does not allow user interrupts (i.e. you have to kill your R session to interrupt).
ppenalty Like penalty but for the pruning pass. Default is penalty.
nprune Maximum number of terms (including intercept) in the pruned model. Default is NULL, meaning all terms. Use this to reduce exhaustive search time, or to enforce a maximum model size. Often used with update.earth.

The following arguments are for internal or advanced use
Object Earth object to be updated, for use by update.earth.
Get.crit Criterion function for model selection during pruning. By default a function that returns the GCV. See the section below on the pruning pass.
Eval.model.subsets Function used to evaluate model subsets — see notes in source code.
Print.pruning.pass Function used to print pruning pass results. — see notes in source code.
... earth.formula: arguments passed to earth.default.
earth.default: unused, but provided for generic/method consistency.

Value

An object of class ‘earth’ which is a list with the components listed below. Term refers to a term created during the forward pass (each line of the output from format.earth is a term). Term number 1 is always the intercept.

fitted.values Fitted values
residuals Residuals
coefficients Least squares coefficients for columns in bx. Each value corresponds to a selected term. coefficients[1] is the intercept.
rss Residual sum-of-squares of the model. Equal to rssVec[length(selected.terms)]. See also rssVec below.
rsq 1-rss/rss.null. R-Squared of the model. A measure of how well the model fits the training data.
gcv Generalised Cross Validation value (GCV) of the model. Equal to gcvVec[length(selected.terms)]. See also gcvVec below. For details of the GCV calculation, see equation 30 in Friedman's MARS paper and earth:::get.gcv.
grsq 1-gcv/gcv.null. An estimate of the predictive power of the model.
Unlike rsq, grsq can be negative. A negative grsq would indicate a severely over parameterised model — a model that would not generalise well even though it may be a good fit to the training data. Example of a negative grsq:
earth(mpg ~ ., data = mtcars, pmethod = "none", trace = 4)

bx Matrix of basis functions applied to x. Each column corresponds to a selected term. Each row corresponds to a row in in the input matrix x, after taking subset. See model.matrix.earth for an example of bx handling. Example bx:
    (Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(...
[1,]          1           0.0           4.6                   0
[2,]          1           0.0           4.3                   0
[3,]          1           0.0           4.1                   0
...
dirs Matrix with ij-th element equal to 1 if term i has a factor of the form x_j > c, equal to -1 if term i has a factor of the form x_j <= c, and to 0 if x_j is not in term i. This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Example dirs:
                       Girth Height
(Intercept)                0      0  #no factors in intercept
h(Girth-12.9)              1      0  #2nd term uses Girth
h(12.9-Girth)             -1      0  #3rd term uses Girth
h(Girth-12.9)*h(Height-76) 1      1  #4th term uses Girth and Height
...
cuts Matrix with ij-th element equal to the cut point for variable j in term i. This matrix includes all terms generated by the forward.pass, including those not in selected.terms. Note that the terms may not be in pairs, because the forward pass deletes linearly dependent terms before handing control to the pruning pass. Example cuts:
                           Girth Height
(Intercept)                  0.0      0  #intercept, no cuts
h(Girth-12.9)               12.9      0  #2nd term has cut at 12.9
h(12.9-Girth)               12.9      0  #3rd term has cut at 12.9
h(Girth-12.9)*h(Height-76)  12.9     76  #4th term has two cuts
...
selected.terms Vector of term numbers in the best model. Can be used as a row index vector into cuts and dirs. The first element selected.terms[1] is always 1, the intercept.
rssVec Residual sum-of-squares for each model size considered by the pruning pass. The length of rssVec is nprune. The null RSS (i.e. the RSS of an intercept only-model) is rssVec[1]. The RSS of the selected model is rssVec[length(selected.terms)].
gcvVec GCV for each model in prune.terms. The length of gcvVec is nprune. The null GCV (i.e. the GCV of an intercept-only model) is gcvVec[1]. The GCV of the selected model is gcvVec[length(selected.terms)].
prune.terms The row index of prune.terms is the model size (the model size is the number of terms in the model). Each row is a vector of term numbers for the best model of that size. An element is 0 if the term is not in the model, thus prune.terms is a lower triangular matrix, with dimensions nprune x nprune. The model selected by the pruning pass is at row length(selected.terms). Example prune.terms:
[1,]    1  0  0  0  0  0  0  #intercept-only model
[2,]    1  2  0  0  0  0  0  #best 2 term model uses terms 1,2.
[3,]    1  2  4  0  0  0  0  #best 3 term model uses terms 1,2,4
[4,]    1  2  9  8  0  0  0
...
ppenalty The GCV penalty used during pruning. A copy of earth's ppenalty argument.
call The call used to invoke earth.
terms Model frame terms. This component exists only if the model was built using earth.formula.

Note

Standard Model Functions

Standard model functions such as case.names are provided for earth objects and are not explicitly documented.

Other Implementations

The results are similar to but not identical to other Multivariate Adaptive Regression Splines implementations. The differences stem from the forward pass where very small implementation differences (or perturbations of the input data) can cause rather different selection of terms and knots. The backward passes give identical or near identical results, given the same forward pass results.

The source code of earth is derived from mars in the mda package written by by Trevor Hastie and Robert Tibshirani. Unlike earth, mda::mars allows multiple responses. See also mars.to.earth.

The term ‘MARS’ is trademarked and licensed exclusively to Salford Systems http://www.salfordsystems.com. Their implementation uses an engine written by Friedman and offers more features than earth.

Limitations

Multiple responses are not yet supported.

There is no special support for factors.

The following aspects of MARS are mentioned in Friedman's papers but not implemented in earth:
i) Piecewise cubic models
ii) Specifying which predictors must enter linearly
iii) Specifying which predictors can interact
iv) Model slicing (plotmo goes part way)
v) Handling missing variables
vi) Logistic regression and special handling of categorical predictors
vii) Fast MARS h parameter.

The Forward Pass

The forward pass adds MARS terms in pairs until the first of the following conditions is met:
i) reach maximum number of terms (nterms>=nk).
ii) reach DeltaRSq threshold (DeltaRSq<thresh) where DeltaRSq is the difference in R-Squared caused by adding the current term pair.
iii) reach max RSq (RSq>1-thresh).
iv) reach min GRSq (GRSq< -10).

Set trace>=2 to see the stopping condition.

The result of the forward pass is the set of terms defined by $dirs and $cuts. As a final step, the forward pass deletes linearly dependent terms, if any, so all terms in $dirs and $cuts are independent.

Note that GCVs (via GRSq) are used during the forward pass only as one of the stopping conditions and in trace prints.

The Pruning Pass

The pruning pass is handed the sets of MARS terms created by the forward pass and works like this: it determines the subset of terms (using pmethod) with the lowest RSS for each model size in 1:nprune. It saves the RSS and term numbers for each such subset in rssVec and prune.terms. It then applies the Get.crit function with ppenalty to rssVec to yield gcvVec. It chooses the model with lowest value in gcvVec, and puts its term numbers into selected.terms. Finally, it runs lm to determine the fitted.values, residuals, and coefficients, by regressing the input vector y on the selected.terms of bx.

By default Get.crit is earth:::get.gcv. Alternative Get.crit functions can be defined. See the source code of get.gcv for an example.

Testing on New Data

This example demonstrates one way to train on 80% of the data and test on the remaining 20%. (Repeated runs of the code show the high variance of R-Squared associated with a model built from a small dataset from which many parameters have to be estimated.)

train.subset <- sample(1:nrow(ozone), .8 * nrow(ozone))
test.subset <- (1:nrow(ozone))[-train.subset]
a <- earth(Volume~., data=trees[train.subset, ])
yhat <- predict(a, newdata=trees[test.subset, ])
y <- trees$Volume[test.subset]
print(1 - sum((y - yhat)^2)/sum((y - mean(y))^2)) # print R-Squared
Large Models and Execution Time

For a given set of input data, the following can increase the speed of the forward pass:
i) increasing fast.k
ii) decreasing nk
iii) decreasing degree
iv) increasing threshold
v) increasing min.span.

The backward pass is normally much faster than the forward pass, unless pmethod="exhaustive". Reducing npune reduces exhaustive search time. One strategy is to first do a forward pass with pmethod="none" and then use update.earth to adjust pruning parameters.

For big models, earth is much faster than mda::mars.

Using fast.k

In general, with a low fast.k (say 5), earth is faster; with a high fast.k, or with fast.k disabled (set to -1), earth builds a better model. However it is not unusual to get a better model with a lower fast.k. You will need to experiment using your data.

Warning and Error Messages

Earth prints most error and warning messages without printing the ‘call’. If you are mystified by a warning message, try setting options(warn=2) and using traceback.

Author(s)

Stephen Milborrow, derived from mda::mars by Trevor Hastie and Robert Tibshirani.

This is an early release and users are encouraged to send feedback — use milboATsonicPERIODnet.

References

The primary references are the Friedman papers. Readers may find the MARS section in Hastie, Tibshirani, and Friedman a more accessible introduction. Faraway takes a hands-on approach, using the ozone data to compare mda::mars with other techniques. (If you use Faraway's examples with earth instead of mars, use $bx instead of $x). Earth's pruning pass uses leaps which is based on techniques in Miller.

Faraway Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23

Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1–141

Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 http://www-stat.stanford.edu/research/index.html

Hastie, Tibshirani, and Friedman (2001) The Elements of Statistical Learning http://www-stat.stanford.edu/~hastie/pub.htm

Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression

See Also

format.earth, get.nterms.per.degree, get.nused.preds.per.subset, mars.to.earth, model.matrix.earth, ozone1, plot.earth.models, plot.earth, plotmo, predict.earth, reorder.earth, summary.earth, update.earth

Examples

a <- earth(Volume ~ ., data = trees)
summary(a, digits = 2)

# yields:
#    Call:
#    earth(formula = Volume ~ ., data = trees)
#    
#    Expression:
#      23 
#      +  5.7 * pmax(0,  Girth -     13) 
#      -  2.9 * pmax(0,     13 -  Girth) 
#      + 0.72 * pmax(0, Height -     76) 
#    
#    Number of cases: 31
#    Selected 4 of 5 terms, and 2 of 2 predictors
#    Number of terms at each degree of interaction: 1 3 (additive model)
#    GCV: 11     RSS: 213     GRSq: 0.96     RSq: 0.97 

[Package earth version 0.1-3 Index]