earth {earth} | R Documentation |
Build a regression model using the techniques in Friedman's papers ‘Multivariate Adaptive Regression Splines’ and ‘Fast MARS’.
## S3 method for class 'formula': earth(formula, data, ...) ## Default S3 method: earth(x = stop("no 'x' arg"), y = stop("no 'y' arg"), weights = NULL, subset = NULL, na.action = na.fail, penalty = if(degree > 1) 3 else 2, trace = 0, keepxy = FALSE, degree = 1, nk = max(21, 2 * NCOL(x) + 1), thresh = 0.001, minspan = 1, newvar.penalty = 0, fast.k = 20, fast.beta = 1, fast.h = NULL, pmethod = "backward", ppenalty = penalty, nprune = NULL, Object = NULL, Get.crit = get.gcv, Eval.model.subsets = eval.model.subsets, Print.pruning.pass = print.pruning.pass, Force.xtx.prune = FALSE, ...)
All arguments are optional except formula
, or x
and y
.
NAs are not allowed.
To start off, look at the arguments
formula
,
x
,
y
,
degree
,
nk
, and
nprune
.
formula |
Model formula. |
data |
Data frame for formula .
|
x |
Matrix containing the independent variables. |
y |
Vector containing the response variable,
or in the case of multiple responses,
a matrix whose columns are the response values for each variable.
If the y values are very big or very small then you may get
better results if you scale y first for
better numerical stability in the forward pass.
|
weights |
Weight vector (not yet supported). |
subset |
Index vector specifying which cases to use i.e. which rows in x to use.
Default is NULL, meaning all.
|
na.action |
NA action. Default is na.fail , and only na.fail is supported.
|
penalty |
Generalized Cross Validation (GCV) penalty per knot.
Default is if(degree>1) 3 else 2 .
A value of 0 penalises only terms, not knots.
The value -1 is a special case, meaning no penalty, so GCV=RSS/n.
Theory suggests values in the range of about 2 to 3.
In practice for big data sets, larger values can be useful to force a smaller model.
See also ppenalty .
|
trace |
Trace earth's execution. Default is 0. Values:
0 none 1 overview 2 forward 3 pruning 4 more pruning 5 ... |
keepxy |
Set to TRUE to retain x , y , and subset in the returned value.
Default is FALSEThe following arguments are for the forward pass |
degree |
Maximum degree of interaction (Friedman's mi). Default is 1, meaning build an additive model. |
nk |
Maximum number of model terms before pruning.
Includes the intercept.
Default is max(21,2*NCOL(x)+1) .
The number of terms created by the forward pass will be
less than nk if there are linearly dependent terms
which must be discarded,
or if a forward stopping condition is reached before nk terms.
See the section below on the forward pass.
|
thresh |
Forward stepping threshold. Default is 0.001. This is one of the arguments used to decide when forward stepping should terminate. See the section below on the forward pass. |
minspan |
Minimum distance between knots. The default value of 1 means consider all knots. The special value of 0 means calculate the minspan internally as per Friedman's MARS paper section 3.8 with alpha = 0.05. Intended to increase resistance to runs of noise in the input data. Set trace>=2 to see the calculated value.Note: the default minspan was 0 prior to earth version 1.1 (July 2007). |
newvar.penalty |
Penalty for adding a new variable in the forward pass (Friedman's gamma, equation 74 in the MARS paper). This argument can mitigate the effects of colinearity or concurvity in the input data. Default is 0, meaning no penalty for adding a new variable. Useful non-zero values range from about 0.01 to 0.2 — you will need to experiment. |
fast.k |
Maximum number of considered parent terms, as
as described in Friedman's Fast MARS paper section 3.0.
Default is 20.
The special value -1 is equivalent to infinity, meaning no Fast MARS.
Typical values, apart from -1, are 20, 10, or 5.
In general, with a lower fast.k (say 5), earth is faster;
with a higher fast.k , or with fast.k disabled (set to -1), earth builds
a better model.
However it is not unusual to get a slightly better model with a lower fast.k ,
and you may need to experiment.
|
fast.beta |
Fast MARS ageing coefficient, as described in the Fast MARS paper section 3.1. Default is 1. A value of 0 sometimes gives better results. |
fast.h |
Fast MARS h, as described in the Fast MARS paper section 4.0.
(not yet implemented). The following arguments are for the pruning pass. One strategy is to build a large model and then trim it back using update.earth with the arguments below.
|
pmethod |
Pruning method.
Default is "backward" .
One of: backward none exhaustive forward seqrep .
If y has multiple columns, then only backward or none
is allowed.
Pruning can take a while if exhaustive is chosen and
the model is big (more than about 30 terms).
The current version of the leaps package
used during pruning does not allow user interrupts
(i.e. you have to kill your R session to interrupt).
|
ppenalty |
Like penalty but for the pruning pass.
Default is penalty .
|
nprune |
Maximum number of terms (including intercept) in the pruned model.
Default is NULL, meaning all terms.
Use this to reduce exhaustive search time, or to enforce a maximum model size. The following arguments are for internal or advanced use |
Object |
Earth object to be updated, for use by update.earth .
|
Get.crit |
Criterion function for model selection during pruning. By default a function that returns the GCV. See the section below on the pruning pass. |
Eval.model.subsets |
Function to evaluate model subsets — see notes in source code. |
Print.pruning.pass |
Function to print pruning pass results — see notes in source code. |
Force.xtx.prune |
Default is FALSE.
This argument pertains to subset evaluation in the pruning pass.
By default,
if y has a single column then earth calls the leaps routines;
if y has multiple columns then earth calls EvalSubsetsUsingXtx .
The leaps routines are based on the QR decomposition;
the EvalSubsetsUsingXtx routine is based on the inverse of X'X.
The leaps routines are more accurate but do not support multiple responses.
Setting Force.xtx.prune=TRUE forces use of EvalSubsetsUsingXtx , even
if y has a single column.
|
... |
earth.formula : arguments passed to earth.default .
earth.default : unused, but provided for generic/method consistency.
|
An object of class ‘earth’ which is a list with the components
listed below.
Term refers to a term created during the forward pass
(each line of the output from format.earth
is a term).
Term number 1 is always the intercept.
bx |
Matrix of basis functions applied to x .
Each column corresponds to a selected term.
Each row corresponds to a row in in the input matrix x ,
after taking subset .
See model.matrix.earth for an example of bx handling.
Example bx :(Intercept) h(Girth-12.9) h(12.9-Girth) h(Girth-12.9)*h(... [1,] 1 0.0 4.6 0 [2,] 1 0.0 4.3 0 [3,] 1 0.0 4.1 0 ... |
dirs |
Matrix with ij-th element equal to 1 if term
i has a factor of the form x_j > c, equal to -1 if
term i has a factor of the form x_j <= c, and to 0 if
x_j is not in term i.
This matrix includes all terms generated by the forward.pass,
including those not in selected.terms .
Note that the terms may not be in pairs, because the forward pass
deletes linearly dependent terms before handing control to the pruning pass.
Example dirs :Girth Height (Intercept) 0 0 #no factors in intercept h(Girth-12.9) 1 0 #2nd term uses Girth h(12.9-Girth) -1 0 #3rd term uses Girth h(Girth-12.9)*h(Height-76) 1 1 #4th term uses Girth and Height ... |
cuts |
Matrix with ij-th element equal to the cut point
for variable j in term i.
This matrix includes all terms generated by the forward.pass,
including those not in selected.terms .
Note that the terms may not be in pairs, because the forward pass
deletes linearly dependent terms before handing control to the pruning pass.
Example cuts :Girth Height (Intercept) 0.0 0 #intercept, no cuts h(Girth-12.9) 12.9 0 #2nd term has cut at 12.9 h(12.9-Girth) 12.9 0 #3rd term has cut at 12.9 h(Girth-12.9)*h(Height-76) 12.9 76 #4th term has two cuts ... |
selected.terms |
Vector of term numbers in the best model.
Can be used as a row index vector into cuts and dirs .
The first element selected.terms[1] is always 1, the intercept.
|
prune.terms |
A matrix specifying which terms appear in which subsets.
The row index of prune.terms is the model size
(the model size is the number of terms in the model).
Each row is a vector of term numbers for the best model of that size.
An element is 0 if the term is not in the model, thus prune.terms is a
lower triangular matrix, with dimensions nprune x nprune .
The model selected by the pruning pass
is at row length(selected.terms) .
Example prune.terms :[1,] 1 0 0 0 0 0 0 #intercept-only model [2,] 1 2 0 0 0 0 0 #best 2 term model uses terms 1,2. [3,] 1 2 4 0 0 0 0 #best 3 term model uses terms 1,2,4 [4,] 1 2 9 8 0 0 0 ... |
rss |
Residual sum-of-squares (RSS) of the model
(summed over all responses if y has multiple columns).
|
rsq |
1-rss/rss.null .
R-Squared of the model
(calculated over all responses if y has multiple columns).
A measure of how well the model fits the training data.
|
gcv |
Generalised Cross Validation (GCV) of the model
(summed over all responses if y has multiple columns).
The GCV is calculated using ppenalty (as are all returned GCVs).
For details of the GCV calculation, see
equation 30 in Friedman's MARS paper and earth:::get.gcv .
|
grsq |
1-gcv/gcv.null .
An estimate of the predictive power of the model
(calculated over all responses if y has multiple columns).
Unlike rsq , grsq can be negative.
A negative grsq would indicate
a severely over parameterised model — a model that
would not generalise well
even though it may be a good fit to the training data.
Example of a negative grsq :earth(mpg~., data=mtcars, pmethod="none", trace=4)
|
rss.per.response |
A vector of the RSS for each response.
Length is ncol(y) .
The rss component above is equal to sum(rss.per.response) .
|
rsq.per.response |
A vector of the R-Squared for each response.
Length is ncol(y) .
|
gcv.per.response |
A vector of the GCV for each response.
Length is ncol(y) .
The gcv component above is equal to sum(gcv.per.response) .
|
grsq.per.response |
A vector of the GRSq for each response. Length is ncol(y) .
|
rss.per.subset |
A vector of the RSS for each model subset generated by the pruning pass.
Length is nprune .
If y has multiple columns, the RSS is summed over all responses for each subset.
The null RSS (i.e. the RSS of an intercept only-model) is rss.per.subset[1] .
The rss above isrss.per.subset[length(selected.terms)] .
|
gcv.per.subset |
A vector of the GCV for each model in prune.terms .
Length is is nprune .
If y has multiple columns, the GCV is summed over all responses for each subset.
The null GCV (i.e. the GCV of an intercept-only model) is gcv.per.subset[1] .
The gcv above is gcv.per.subset[length(selected.terms)] .
|
fitted.values |
Fitted values.
A matrix with dimensions nrow(y) x ncol(y) .
|
residuals |
Residuals.
A matrix with dimensions nrow(y) x ncol(y) .
|
coefficients |
Regression coefficients.
A matrix with dimensions length(selected.terms) x ncol(y) .
Each column holds the least squares coefficients from regressing that
column of y on bx .
The first row holds the intercept coefficients.
|
ppenalty |
The GCV penalty used during pruning.
A copy of earth's ppenalty argument.
|
call |
The call used to invoke earth .
|
terms |
Model frame terms.
This component exists only if the model was built using earth.formula .
|
x |
|
y |
|
subset |
Copy of input arguments x , y , and subset .
These components exist only if keepxy=TRUE .
|
Standard Model Functions
Standard model functions such as case.names
are provided for earth
objects and are not explicitly documented.
Other Implementations
The results are similar to but not identical to other Multivariate Adaptive Regression Splines implementations. The differences stem from the forward pass where very small implementation differences (or perturbations of the input data) can cause rather different selection of terms and knots. The backward passes give identical or near identical results, given the same forward pass results.
The source code of earth
is derived from mars
in the mda
package written by
by Trevor Hastie and Robert Tibshirani.
See also mars.to.earth
.
Earth's principal advantages over mda::mars are that it is
much faster and provides plotting and printing methods.
The term ‘MARS’ is trademarked and licensed exclusively to
Salford Systems http://www.salfordsystems.com.
Their implementation uses an engine written by Friedman and
offers more features than earth
.
Limitations
The following aspects of MARS
are mentioned in Friedman's papers but not implemented in earth
:
i) Piecewise cubic models
ii) Specifying which predictors must enter linearly
iii) Specifying which predictors can interact
iv) Model slicing (plotmo
goes part way)
v) Handling missing values
vi) Logistic regression
vii) Special handling of categorical predictors
viii) Fast MARS h parameter
ix) Cross validation to determine penalty
x) Anova tables with sigma and other information.
Large Models and Execution Time
For a given set of input data, the following can increase the speed of the forward pass:
i) decreasing fast.k
ii) decreasing nk
iii) decreasing degree
iv) increasing threshold
v) increasing min.span
.
The backward pass is normally much faster than the forward pass,
unless pmethod="exhaustive"
.
Reducing nprune
reduces exhaustive search time.
One strategy is to first build a large model
and then adjust pruning parameters such as nprune
using update.earth
.
The Forward Pass
The forward pass adds terms in pairs until the first of the following conditions is met:
i) reach maximum number of terms (nterms>=nk)
ii) reach DeltaRSq threshold (DeltaRSq<thresh)
where
DeltaRSq is the difference in R-Squared caused by adding the current term pair
iii) reach max RSq (RSq>1-thresh)
iv) reach min GRSq (GRSq< -10)
.
Set trace>=1
to see the stopping condition.
The result of the forward pass is the set of terms defined by $dirs
and $cuts
.
Note that GCVs (via GRSq) are used during the forward pass only as one of the
stopping conditions and in trace prints.
Changing the GCV $penalty
does not change the knot positions.
The various stopping conditions mean that the actual number of terms
created by the forward pass may be less than nk. There are some more
reasons why the actual number of terms may be less than nk:
(i) the forward pass chucks out one "side" of the pair
if it adds nothing to the model —
but the forward pass counts terms as if they were actually created in pairs.
(ii) as a final step, the forward pass deletes linearly dependent terms, if any,
so all terms in $dirs
and $cuts
are independent.
And remember that the pruning pass will further discard terms.
The Pruning Pass
The pruning pass is handed the sets of terms created by the forward pass.
Its job is to find the subset of these terms that gives the lowest GCV.
The pruning pass works like this:
it determines the subset of terms (using pmethod
)
with the lowest RSS for each model size in 1:nprune
(see Force.xtx.prune
above for some details).
It saves the RSS and term numbers for each such subset in rss.per.subset
and prune.terms
.
It then applies the Get.crit
function with ppenalty
to rss.per.subset
to yield gcv.per.subset
.
It chooses the model with the lowest value in gcv.per.subset
,
and puts its term numbers into selected.terms
.
Finally, it runs lm
to determine the
fitted.values
, residuals
, and coefficients
, by regressing
the response y
on the selected.terms
of bx
.
Set trace>=3
to trace the pruning pass.
By default Get.crit
is earth:::get.gcv
.
Alternative Get.crit
functions can be defined.
See the source code of get.gcv
for an example.
Testing on New Data
This example demonstrates one way to train on 80% of the data and test on the remaining 20%. In practice a dataset larger than the one below should be used for splitting. Also, remember that the test set should not be used for parameter tuning — use GCVs or separate validation sets for that.
train.subset <- sample(1:nrow(trees), .8 * nrow(trees)) test.subset <- (1:nrow(trees))[-train.subset] a <- earth(Volume ~ ., data = trees[train.subset, ]) yhat <- predict(a, newdata = trees[test.subset, ]) y <- trees$Volume[test.subset] print(1 - sum((y - yhat)^2) / sum((y - mean(y))^2)) # print R-SquaredHow do I measure variable importance?
Establishing predictor importance is in general a tricky and even controversial problem.
Running plotmo
with ylim=NULL
(the default)
gives an idea of which predictors make the largest changes to the predicted value.
You can also use drop1
(assuming you are using the formula interface to earth).
Calling drop1(my.earth.model)
will delete each predictor in turn from your model,
rebuild the model from scratch each time, and calculate the GCV each time.
You will get warnings that the earth library function extractAIC.earth
is
returning GCVs instead of AICs — but that is what you want so you can
ignore the warnings.
The column labeled AIC
in the printed response
from drop1
will actually be a column of GCVs not AICs.
The Df
column is not much use in this context.
You will get lots of output from drop1
if you built your original earth
model with trace>0
.
You can set trace=0
by updating your model before calling drop1
.
Do it like this:
my.model <- update.earth(my.model, trace=0)
.
Remember that these techniques only tell you how important a variable is with the other variables already in the model. There are alternative ways of measuring variable importance (using resampling) but they are not yet implemented.
How do I know which predictors were added to the model first?
You can see the forward pass adding terms with trace=2
or higher.
But remember, pruning will remove some of the terms.
Another approach is to use
summary(my.model, decomp="none")
which will list the basis functions remaining after pruning, in the order they were added by the forward pass.
Can I get a list of the predictors actually used in the model?
Use this function:
get.used.pred.names <- function(obj) # obj is an earth object names(which(apply(obj$dirs[obj$selected.terms,,drop=FALSE],2,any)))Multiple Response Models
If y
has K columns then earth builds K simultaneous models.
Each model has the same set of basis functions (i.e. same bx
and selected.terms
)
but different coefficients (the returned coefficients
will have K columns).
The models are built and pruned as usual but with the GCVs
and RSSs averaged across all K responses.
Since earth attempts to optimize for all models simultaneously, the results will not be as "good" as building the models independently. i.e. the GCV of the combined model will not be as good as the GCVS for independent models, on the whole. However, the combined model may be a better model in other senses, depending on what you are trying to achieve.
For more details see section 4.1 of Hastie, Tibshirani, and Buja Flexible Discriminant Analysis by Optimal Scoring, JASA, December 1994 http://www-stat.stanford.edu/~hastie/Papers/fda.pdf.
Using earth with fda and mda
Earth
can be used with fda
and mda
in the mda
package. Earth will be a multiple response model,
as described above.
Use keep.fitted=TRUE
if you want to call plot.earth
later
(actually only necessary for large datasets, see the description of
keep.fitted
in fda
).
Use keepxy=TRUE
if you want to call update
or plotmo
later.
Use trace>=3
to see the call to earth
generated by fda
or mda
.
Example:
library(mda) (a <- fda(Species ~ ., data=iris, keep.fitted=TRUE, method=earth, keepxy=TRUE)) plot(a) summary(a$fit) # examine earth model embedded in fda model plot(a$fit) plotmo(a$fit, ycolumn=1, ylim=c(-1.5,1.5), clip=FALSE) plotmo(a$fit, ycolumn=2, ylim=c(-1.5,1.5), clip=FALSE)Warning and Error Messages
Earth
prints most error and warning messages without
printing the ‘call’.
If you are mystified by a warning message, try
setting options(warn=2)
and using traceback
.
Stephen Milborrow, derived from mda::mars
by Trevor Hastie and Robert Tibshirani.
Users are encouraged to send feedback — use milboATsonicPERIODnet http://www.milbo.users.sonic.net.
The primary references are
the Friedman papers. Readers may find the MARS section in Hastie, Tibshirani,
and Friedman a more accessible introduction. Faraway takes a hands-on approach,
using the ozone
data to compare mda::mars
with other techniques.
(If you use Faraway's examples with earth
instead of mars
, use $bx
instead of $x
).
Earth's pruning pass uses leaps
which is based on
techniques in Miller.
Faraway Extending the Linear Model with R http://www.maths.bath.ac.uk/~jjf23
Friedman (1991) Multivariate Adaptive Regression Splines (with discussion) Annals of Statistics 19/1, 1–141
Friedman (1993) Fast MARS Stanford University Department of Statistics, Technical Report 110 http://www-stat.stanford.edu/research/index.html
Hastie, Tibshirani, and Friedman (2001) The Elements of Statistical Learning http://www-stat.stanford.edu/~hastie/pub.htm
Miller, Alan (1990, 2nd ed. 2002) Subset Selection in Regression
format.earth
,
get.nterms.per.degree
,
get.nused.preds.per.subset
,
mars.to.earth
,
model.matrix.earth
,
ozone1
,
plot.earth.models
,
plot.earth
,
plotmo
,
predict.earth
,
reorder.earth
,
summary.earth
,
update.earth
a <- earth(Volume ~ ., data = trees) summary(a, digits = 2) # yields: # Call: # earth(formula = Volume ~ ., data = trees) # # Expression: # 27 # + 6 * pmax(0, Girth - 14) # - 3.2 * pmax(0, 14 - Girth) # + 0.61 * pmax(0, Height - 75) # # Number of cases: 31 # Selected 4 of 5 terms, and 2 of 2 predictors # Number of terms at each degree of interaction: 1 3 (additive model) # GCV: 11 RSS: 196 GRSq: 0.96 RSq: 0.98