plot.earth {earth} | R Documentation |
Plot an earth
object.
The plot shows model selection, cumulative distribution
of the residuals, residuals versus fitted values, and the residual QQ plot.
## S3 method for class 'earth': plot(x = stop("no 'x' arg"), which = 1:4, ycolumn = 1, caption = if(do.par) NULL else "", col.rsq = "lightblue", col.loess = col.rsq, col.qq = col.rsq, col.grid = "grey", col.vline = "grey", lty.vline = 3, col.legend = 1, col.npreds = 1, nresiduals = 1000, cum.grid = "percentages", rlim = c(-1,-1), jitter = 0, id.n = 3, labels.id = rownames(residuals(x, warn=FALSE)), legend.pos = NULL, do.par = TRUE, main = NULL, pch = 1, ...)
x |
An earth object.
This is the only required argument.
(This argument is called "x" for consistency with the generic plot .)
|
which |
Which plots to plot. Default is 1:4 , meaning all.1) model selection (GRSq plot) 2) cumulative distribution of absolute values of residuals 3) residuals versus fitted values 4) QQ plot of residuals |
ycolumn |
Specify which column of the response to plot if the model has multiple responses.
Default is 1.
This argument does not affect the model selection plot which is always across all responses. TODO there is an issue in the handling of ycolumn for multiple
level factor responses.
Does ycolumn refer to the column in the observed or predicted response?
|
caption |
Overall caption. The default value is
if(do.par) NULL else "" . Values are:"string" string"" no captionNULL generate a caption from the $call component of the earth object.
|
|
For all the col arguments, 0 means don't plot the corresponding graph element. |
col.rsq |
Color of RSq line in model selection plot.
Default is "lightblue" .
|
col.loess |
Color of loess line in residuals plot.
Default is col.rsq .
Generating the loess line occasionally causes warnings such as
"Warning: pseudoinverse used".
To get rid of these warnings, set col.loess=0
|
col.qq |
Color of QQ line.
Default is col.rsq .
|
col.grid |
Color of grid lines in cumulative distribution plot. Default is "grey" .
|
col.vline |
Color of vertical line at best model in model plot. Default is "grey" .
|
lty.vline |
Line type of vertical line at best model in model plot. Default is 3 .
|
col.legend |
Color of legend (inside plot area) of model plot.
Default is 1 , meaning draw a legend.
Use 0 for no legend.
|
col.npreds |
Color of the "number of predictors" plot within the model plot.
Default is 1 .
Use 0 for no "number of predictors" plot.
|
nresiduals |
Maximum number of residuals to plot.
Use -1 for all.
Default is 1000 (not all to reduce over-plotting).
A systematic sample of size nresiduals is taken but
the largest few residuals are always included.
|
cum.grid |
Specify grid on cumulative distribution graph.
Values are:"none" no grid on cumulative distribution plot"grid" add grid"percentages" (default) add grid and percentage labels to quantile lines.
|
rlim |
Two element vector c(min,max) specifying min and max
values on the y axis of the RSq plot.
Default is c(-1,-1) .
Special value min=-1 means the minimum y axis value
is the smallest GRSq or RSq value excluding the intercept values.
Special value max=-1 means the maximum y axis value
is the largest GRSq or RSq value.
|
jitter |
Jitter applied to GRSq and RSq values to minimize over-plotting.
Default is 0 , meaning no jitter.
A typical useful value is 0.01.
|
id.n |
Number of largest residuals to be labeled. Default is 3 .
|
labels.id |
Residual names. Default is rownames(residuals(x)) .
|
legend.pos |
NULL (default) means position legend automatically. Else specify c(x,y)
in user coordinates.The following settings are related to par() and are included so you can override the defaults.
|
do.par |
Call par() for global settings as appropriate.
Default is TRUE ,
which sets mfrow, mar=c(4,4,2,1), mgp=c(1.6,0.6,0), cex=0.7 .
Set to FALSE if you want to append figures to an existing plot.
|
main |
Title of each plot.
Default is NULL , meaning generate figure headings automatically.
|
pch |
Plot character for QQ and residuals plot. Default is 1 .
|
... |
Extra arguments passed to plotting functions. |
Interpreting the plot.earth
graphs
For concreteness, the description below is based
on the graphs plotted by example(plot.earth)
.
The graphs plotted by plot.earth
, apart from the Model Selection graph, are
standard tools used in residual analysis and more information
can be found in most linear regression textbooks.
One should be wary of over-interpretation of the graphs, since the residuals are measured on the training data rather than on new data. In linear models that is usually not an issue, but for flexible models like MARS the residuals measured on the training data give an optimistic view of the model's predictive ability.
Nomenclature.
The residuals are the differences between the values predicted by the model
and the corresponding response values.
In this help page the residuals are all measured on the training data.
The residual sum of squares (RSS) is the sum of the squared values of the residuals.
R-Squared (RSq, also called the coefficient of determination)
is a normalized form of the RSS,
and, depending on the model, varies from 0
(a model that always predicts the same value i.e. the mean observed response value)
to 1 (a model that perfectly predicts the responses in the training data).
The Generalized Cross Validation (GCV)
is a form of the RSS penalized by the effective number of model parameters
(and divided by the number of observations).
More details can be found in the FAQ section of the earth
help page.
The GRSq normalizes the GCV in the same way that the RSq normalizes
the RSS.
The GCV and GRSq are measures of the generalization ability of the model,
i.e., how well the model would predict using data not in the training set.
There is some arbitrariness in their values since the effective
number of model parameters is a just an estimate in MARS models.
In the example Model Selection graph,
the RSq and GRSq run together at first, but diverge as the number of terms increases.
This is typical behavior, and what we are seeing is an increased penalty
being applied to the GCV as the number of model parameters increases.
The vertical gray dotted line is positioned at the maximum GRSq
and indicates that the best model has 11 terms and uses all 8 predictors
(the number of predictors is shown by the black dotted line).
The graph also shows the number of predictors and terms we would need
if we were prepared to accept a lower GRSq (use the earth parameter nprune
to trim the model).
The Cumulative Distribution graph shows the cumulative distribution of the absolute values of residuals. What we would ideally like to see is a graph that starts at 0 and shoots up quickly to 1. In the example graph, the median absolute residual is about 2.2 (look at the vertical gray line for 50%). We see that 95% of the absolute values of residuals are less than about 7.1 (look at the vertical gray line for 95%). So in the training data, 95% of the time the predicted value is within 7.1 units of the observed value.
The Residuals vs Fitted graph shows the residual for each value of the predicted response. By comparing the scales of the axes one can get an immediate idea of the size of the residuals relative to the predicted values.
Ideally the residuals should show constant variance
i.e. the residuals should remain evenly spread out as the fitted values increase.
However, in the example graph we see heteroscedasity — the residuals spread out in a "<" shape.
There is a decrease in the accuracy of the predictions as the predicted value increases.
To reduce the heteroscedasity,
we could refit the model after performing a transform on the response.
A log
transform, for instance, would even out the residuals:
a1 <- earth(log(O3) ~ ., data = ozone1, degree = 2) plot(a1)Transforming the data may cause other problems, such as mismatches to a known underlying physical model or difficulties in interpretation, so it's best to consult (or become) an expert on the type of data being modeled (in this case, ozone pollution data).
The pale blue line is a loess
fit.
(Readers not familar with loess
fits can think of them as fancy moving averages.)
In this instance it shows that the mean residual is more or less constant
except at low fitted values.
The end effect is possibly due to failure of the model in that region because
of smaller residuals, but cause and effect get tangled here.
Compare the residuals of the earth model to the linear model, and notice
how the pattern of residuals show that the earth model is more succesful
at modeling non-linearities in the data:
a2 <- lm(O3 ~ ., data = ozone1) plot(a2, which=1)One should always eyeball the residuals themselves rather than blindly trusting the
loess
fit, which is itself an approximation.
However, in our example earth model the loess
line appears reliable.
Cases 192, 193, and 226 have the largest residuals and
fall suspiciously into a separate cluster.
(If overplotting makes the labels hard to read,
reduce the number of labels with the id.n
argument of plot.earth
.)
As a general rule, it is worthwhile investigating cases with large residuals.
Perhaps they should be excluded when building the model.
On the other hand, it is possible that they reveal something important about the data
that could warrant changes to the model.
In our example it is also worthwhile looking at cases
with small residuals because of non-linearity in that region.
To see the example input matrix ordered on the magnitude of the residuals,
use ozone1[order(abs(a$residuals)),]
.
Sometimes groups of residuals appear in a series of straight lines with slopes of -1. This effect is slightly visible in the example graph. These lines usually do not indicate a problem. They are are formed when a set of plotted points has the same observed value, commonly due to discretization in the measurement of the observed response.
The Normal Q-Q graph compares the distribution of the residuals to a normal distribution. If the residuals are distributed normally they will lie on the line. Following R convention, the abscissa is the normal axis and the ordinate is the residual axis; some popular books have it the other way round. In our example, we see divergence from normality in the left tail — the left tail of the distribution is fatter than that of a normal distribution. Once again, we see that cases 192, 193, and 226 have the largest residuals.
About plot.earth
and earth-glm models
"Earth-glm" models are models created with a glm
argument to earth
.
In earth-glm models, much of the analysis in the above section does not apply because in these models the residuals are not assumed to have a normal distribution.
Note that the residuals plotted by plot.earth
are residuals from earth's call to lm
after the pruning pass,
not glm
residuals
That is, plot.earth
ignores the glm
part of the model, if any.
For earth-glm models, plotd
can be useful.
Why doesn't plot.earth
print GLM information?
It's just too much to display.
You can instead call plot
on the glm.list
in the earth model like this:
data(etitanic) a <- earth(survived ~ ., data=etitanic, glm=list(family=binomial)) par(mfrow=c(2,2)) plot(a$glm.list[[1]])
I want to add lines or points to the RSq plot, and am having trouble getting my axis scaling right. Help?
Use do.par=FALSE
.
With do.par=FALSE
, the axis scales match the axis labels.
With do.par=TRUE
, plot.earth
restores the
par
parameters and axis scales to
what they were before calling plot.earth
.
This usually means that the x- and y-axis scales are both 0 to 1.
earth
,
plot.earth.models
,
plotd
,
plotmo
data(ozone1) a <- earth(O3 ~ ., data = ozone1, degree = 2) plot(a)