evimp {earth} | R Documentation |
Estimate variable importances in an earth
object
evimp(obj, trim=TRUE, sqrt.=FALSE)
obj |
An earth object.
|
trim |
If TRUE (default), delete rows in the returned matrix for variables that don't appear in any subsets. |
sqrt. |
Default is FALSE.
If TRUE, take the sqrt of the GCV and RSS importances before
normalizing to 0 to 100.
This arguably gives a better indication of relative importances
because the raw importances are calculated using a sum of squares.
|
A matrix showing the relative importances of the variables in the model.
There is a row for each variable.
The row name is the variable name, but with -unused
appended
if the variable does not appear in the final model.
See also the example later.
The columns of the matrix are:
col
: column index of the variable in the x
argument to earth
.
used
: 1 if the variable is used in the final model, else 0.
Equivalently, 0 if the row name has a -unused
suffix.
nsubsets
: variable importance using the "number of subsets" criterion.
Is the number of subsets that include the variable (see "Three Criteria" below).
gcv
: variable importance using the GCV criterion (see below).
rss
: variable importance using the RSS criterion (see below).
The rows are sorted on the nsubsets
criterion.
This means that values in the nsubsets
column decrease as you go down the column
(more accurately, they are non-increasing).
The values in the gcv
and rss
columns
are also non-increasing, except where the
gcv
or rss
ranking differs from the nsubsets
ranking.
For convenience scanning the columns by eye, there are unnamed columns (not listed above)
after the gcv
column and the rss
column.
These have a 0 where the ranking using the gcv
or rss
criteria differs from
that using the nsubsets
criterion.
In other words, there is a 0 for values that increase as you go
down the gcv
or rss
column.
Introduction to variable importance
What exactly is variable importance?
A working definition is that a variable's importance is a measure
of the effect that observed changes to the variable have on the observed response.
It is this measure of importance that evimp
tries to estimate.
Variable importance in the equation that MARS derives from the data
is not quite the same thing.
For example, if two variables are highly correlated,
MARS will usually drop one when building the model.
Both variables have the same importance in the data but
not in the MARS equation (one variable does not even appear in the equation).
A section below has a few words on how to use plotmo
to estimate variable importance in the MARS equation.
You might say that you can measure a variable's importance by changing the variable's value and measuring how the response changes. However, except in special situations, there are problems with this because:
(i) it assumes we can change the variable, which is usually not the case.
For example, in the trees
data,
we cannot simply generate a new tree of arbitrary height.
(ii) it assumes that changes to a variable occur in isolation.
In practice, a variable is usually tied to other variables,
and a change to the variable would never occur without simultaneous
changes to other variables.
For example, in the trees
data, a change to the height
is associated with a change in the girth.
[Note: this section was written in response to several emails about evimp
.
Your comments would be appreciated.]
Estimating variable importance
Establishing predictor importance is in general a tricky and even controversial problem.
There is no completely reliable way to estimate the importance of the variables
in a standard MARS model,
unless you make further lengthy tests after the model is built
(lengthy tests such as leave-one-out techniques,
see the section below on building many models).
The evimp
function just makes an educated (and in practice useful)
guess as described below.
Three criteria for estimating variable importance
The evimp
functions uses three criteria for estimating variable importance.
1. The nsubsets
criterion counts the number of model subsets that include the variable.
Variables that are included in more subsets are considered more important.
By "subsets" we mean the subsets of terms generated by the pruning pass.
There is one subset for each model size,
and the subset is the best set of terms for that model size.
(These subsets are specified in $prune.terms
in earth's return value.)
Only subsets that are smaller than or equal in size to the final model are used
for estimating variable importance.
2. The rss
criterion first calculates the decrease in the RSS
for each subset relative to the previous subset.
(For multiple response models, RSS's are calculated over all responses.)
Then for each variable it sums these decreases over all subsets that include the variable.
Finally it scales the summed decreases so the maximum summed decrease is 100.
Variables which cause larger net decreases in the RSS are considered more important.
3. The gcv
criterion is the same, but uses the GCV instead of the RSS.
Adding a variable can increase the GCV,
i.e., adding the variable has a deleterious effect on the model.
When this happens, the variable could even have a negative total importance,
and thus appear less important than unused variables.
Note that using RSq's and GRSq's instead of RSS's and GCV's
would give identical estimates of variable importance.
(RSq and GRSq are defined in the Value section of the earth
help page.)
Example
a <- earth(O3 ~ ., data=ozone1, degree=2) evimp(a, trim=FALSE)Yields the following matrix:
col used nsubsets gcv rss temp 4 1 10 100.00 1 100.00 1 humidity 3 1 8 12.68 1 14.78 1 ibt 7 1 8 12.68 1 14.78 1 doy 9 1 7 11.26 1 12.93 1 dpg 6 1 5 6.75 1 7.84 1 ibh 5 1 4 9.58 0 10.46 0 vis 8 1 4 4.38 1 5.30 1 wind 2 1 1 0.74 1 0.98 1 vh-unused 1 0 0 0.00 1 0.00 1The rows are sorted on
nsubsets
.
We see that temp
is considered the most important variable,
followed by humidity
, and so on.
We see that vh
is unused in the final model,
and thus is given an unused
suffix and a 0 in the used
column.
The col
column gives the the column indices of the variables
in the x
argument to earth
(after factors, if any, have been expanded; none in this example).
The nsubsets
column is the number of subsets that included the corresponding variable.
For example, temp
appears in 10 subsets and humidity
in 8.
The gcv
and rss
columns are scaled so
the largest net decrease is 100.
The unnamed columns after the gcv
and rss
columns have a 0 if the corresponding criterion increases instead of decreasing
(i.e., the ranking disagrees with the nsubsets
ranking).
We see that ibh
is considered less important than dpg
using the nsubsets
criterion, but not with the gcv
and rss
criteria.
Estimating variable importance in the MARS equation
Running plotmo
with ylim=NULL
(the default)
gives an idea of which predictors in the MARS equation
make the largest changes to the predicted value
(but only with all other predictors at their median values).
Note that there is only a loose relationship between variable importance in the MARS equation and variable importance in the data — see the Introduction section above.
Using drop1 to estimate variable importance
As an alternative to evimp
,
you can use drop1
(assuming you are using the formula interface to earth).
Calling drop1(my.earth.model)
will delete each predictor in turn from your model,
rebuild the model from scratch each time, and calculate the GCV each time.
You will get warnings that the earth library function extractAIC.earth
is
returning GCVs instead of AICs — but that is what you want so you can
ignore the warnings.
(You can turn off just these warnings by passing warn=FALSE
to drop1
.)
The column labeled AIC
in the printed response
from drop1
will actually be a column of GCVs not AICs.
The Df
column is not much use in this context.
Remember that this technique only tells you how important a variable is with the other variables already in the model. It does not tell you the effect of a variable in isolation.
You will get lots of output from drop1
if you built your original earth
model with trace>0
.
You can set trace=0
by updating your model before calling drop1
.
Do it like this:
my.model <- update.earth(my.model, trace=0)
.
Estimating variable importance by building many models
The variance of the variable importances estimated from an
earth model can be high (meaning that the estimates of variable importance
in a model built with a different realization of the data would be
different).
This variance can be averaged out by building a bagged
earth model and measuring variable importances in that (by
taking the mean of the variable importances in the many earth models
that make up the bagged model). You can do this easily using
the functions bagEarth
and
varImp
in the caret
package.
Measuring variable importance using Random Forests is another way to go,
independently of earth.
See the functions randomForest
and
importance
in the
randomForest
package.
Remarks
This function is useful in practice but the following issues can make it misleading.
MARS models have a high variance — if the data changes a little, the set of basis terms created by the forward pass can change a lot. So estimates of predictor importance can be unreliable because they can vary with even slightly different training data.
Collinear (or otherwise related) variables can mask each other's importance, just as in linear models. This means that if two predictors are closely related, the forward pass will somewhat arbitrarily choose one over the other. The chosen predictor will incorrectly appear more important.
For interaction terms, each variable gets credit for the entire term — thus interaction terms are counted more than once and get a total higher weighting than additive terms (questionably). Each variable gets equal credit in interaction terms even though one variable in that term may be far more important than the other.
For factor predictors, importances are estimated a on a per-level basis.
The evimp
function should probably aggregate these over all levels.
An example of conflicting importances
(however, the results are fine with the default pmethod
):
evimp(earth(mpg~., data=mtcars, pmethod="none"))
Acknowledgment
Thanks to Max Kuhn for the original evimp
code and for helpful discussions.
data(ozone1) a <- earth(O3 ~ ., data=ozone1, degree=2) ev <- evimp(a, trim=FALSE, sqrt.=TRUE) plot(ev) print(ev)