plot.earth {earth}R Documentation

Plot an "earth" object

Description

Plot an earth object. The plot shows model selection, cumulative distribution of the residuals, residuals versus fitted values, and the residual QQ plot.

Usage

## S3 method for class 'earth':
plot(x = stop("no 'x' arg"),
     which = 1:4, ycolumn = 1,
     caption = if(do.par) NULL else "",
     col.rsq = "lightblue", col.loess = col.rsq,
     col.qq = col.rsq, col.grid = "grey",
     col.vline = "grey", lty.vline = 3,
     col.legend = 1, col.npreds = 1,
     nresiduals = 1000, cum.grid = "percentages", rlim = c(-1,-1),
     jitter = 0, id.n = 3, labels.id = rownames(residuals(x, warn=FALSE)),
     legend.pos = NULL, do.par = TRUE,
     main = NULL, pch = 1, ...)

Arguments

x An earth object. This is the only required argument. (This argument is called "x" for consistency with the generic plot.)
which Which plots to plot. Default is 1:4, meaning all.
1) model selection (GRSq plot)
2) cumulative distribution of absolute values of residuals
3) residuals versus fitted values
4) QQ plot of residuals
ycolumn Specify which column of the response to plot if the model has multiple responses. Default is 1. This argument does not affect the model selection plot which is always across all responses.
TODO there is an issue in the handling of ycolumn for multiple level factor responses. Does ycolumn refer to the column in the observed or predicted response?
caption Overall caption. The default value is if(do.par) NULL else "". Values are:
"string" string
"" no caption
NULL generate a caption from the $call component of the earth object.
For all the col arguments, 0 means don't plot the corresponding graph element.
col.rsq Color of RSq line in model selection plot. Default is "lightblue".
col.loess Color of loess line in residuals plot. Default is col.rsq. Generating the loess line occasionally causes warnings such as "Warning: pseudoinverse used". To get rid of these warnings, set col.loess=0
col.qq Color of QQ line. Default is col.rsq.
col.grid Color of grid lines in cumulative distribution plot. Default is "grey".
col.vline Color of vertical line at best model in model plot. Default is "grey".
lty.vline Line type of vertical line at best model in model plot. Default is 3.
col.legend Color of legend (inside plot area) of model plot. Default is 1, meaning draw a legend. Use 0 for no legend.
col.npreds Color of the "number of predictors" plot within the model plot. Default is 1. Use 0 for no "number of predictors" plot.
nresiduals Maximum number of residuals to plot. Use -1 for all. Default is 1000 (not all to reduce over-plotting). A systematic sample of size nresiduals is taken but the largest few residuals are always included.
cum.grid Specify grid on cumulative distribution graph. Values are:
"none" no grid on cumulative distribution plot
"grid" add grid
"percentages" (default) add grid and percentage labels to quantile lines.
rlim Two element vector c(min,max) specifying min and max values on the y axis of the RSq plot. Default is c(-1,-1).
Special value min=-1 means the minimum y axis value is the smallest GRSq or RSq value excluding the intercept values.
Special value max=-1 means the maximum y axis value is the largest GRSq or RSq value.
jitter Jitter applied to GRSq and RSq values to minimize over-plotting. Default is 0, meaning no jitter. A typical useful value is 0.01.
id.n Number of largest residuals to be labeled. Default is 3.
labels.id Residual names. Default is rownames(residuals(x)).
legend.pos NULL (default) means position legend automatically. Else specify c(x,y) in user coordinates.

The following settings are related to par() and are included so you can override the defaults.
do.par Call par() for global settings as appropriate. Default is TRUE, which sets mfrow, mar=c(4,4,2,1), mgp=c(1.6,0.6,0), cex=0.7. Set to FALSE if you want to append figures to an existing plot.
main Title of each plot. Default is NULL, meaning generate figure headings automatically.
pch Plot character for QQ and residuals plot. Default is 1.
... Extra arguments passed to plotting functions.

Note

Interpreting the plot.earth graphs

For concreteness, the description below is based on the graphs plotted by example(plot.earth). The graphs plotted by plot.earth, apart from the Model Selection graph, are standard tools used in residual analysis and more information can be found in most linear regression textbooks.

One should be wary of over-interpretation of the graphs, since the residuals are measured on the training data rather than on new data. In linear models that is usually not an issue, but for flexible models like MARS the residuals measured on the training data give an optimistic view of the model's predictive ability.

Nomenclature. The residuals are the differences between the values predicted by the model and the corresponding response values. In this help page the residuals are all measured on the training data. The residual sum of squares (RSS) is the sum of the squared values of the residuals. R-Squared (RSq, also called the coefficient of determination) is a normalized form of the RSS, and, depending on the model, varies from 0 (a model that always predicts the same value i.e. the mean observed response value) to 1 (a model that perfectly predicts the responses in the training data). The Generalized Cross Validation (GCV) is a form of the RSS penalized by the effective number of model parameters (and divided by the number of observations). More details can be found in the FAQ section of the earth help page. The GRSq normalizes the GCV in the same way that the RSq normalizes the RSS. The GCV and GRSq are measures of the generalization ability of the model, i.e., how well the model would predict using data not in the training set. There is some arbitrariness in their values since the effective number of model parameters is a just an estimate in MARS models.

In the example Model Selection graph, the RSq and GRSq run together at first, but diverge as the number of terms increases. This is typical behavior, and what we are seeing is an increased penalty being applied to the GCV as the number of model parameters increases. The vertical gray dotted line is positioned at the maximum GRSq and indicates that the best model has 11 terms and uses all 8 predictors (the number of predictors is shown by the black dotted line). The graph also shows the number of predictors and terms we would need if we were prepared to accept a lower GRSq (use the earth parameter nprune to trim the model).

The Cumulative Distribution graph shows the cumulative distribution of the absolute values of residuals. What we would ideally like to see is a graph that starts at 0 and shoots up quickly to 1. In the example graph, the median absolute residual is about 2.2 (look at the vertical gray line for 50%). We see that 95% of the absolute values of residuals are less than about 7.1 (look at the vertical gray line for 95%). So in the training data, 95% of the time the predicted value is within 7.1 units of the observed value.

The Residuals vs Fitted graph shows the residual for each value of the predicted response. By comparing the scales of the axes one can get an immediate idea of the size of the residuals relative to the predicted values.

Ideally the residuals should show constant variance i.e. the residuals should remain evenly spread out as the fitted values increase. However, in the example graph we see heteroscedasity — the residuals spread out in a "<" shape. There is a decrease in the accuracy of the predictions as the predicted value increases. To reduce the heteroscedasity, we could refit the model after performing a transform on the response. A log transform, for instance, would even out the residuals:

    a1 <- earth(log(O3) ~ ., data = ozone1, degree = 2)
    plot(a1)
Transforming the data may cause other problems, such as mismatches to a known underlying physical model or difficulties in interpretation, so it's best to consult (or become) an expert on the type of data being modeled (in this case, ozone pollution data).

The pale blue line is a loess fit. (Readers not familar with loess fits can think of them as fancy moving averages.) In this instance it shows that the mean residual is more or less constant except at low fitted values. The end effect is possibly due to failure of the model in that region because of smaller residuals, but cause and effect get tangled here. Compare the residuals of the earth model to the linear model, and notice how the pattern of residuals show that the earth model is more succesful at modeling non-linearities in the data:

    a2 <- lm(O3 ~ ., data = ozone1)
    plot(a2, which=1)
One should always eyeball the residuals themselves rather than blindly trusting the loess fit, which is itself an approximation. However, in our example earth model the loess line appears reliable.

Cases 192, 193, and 226 have the largest residuals and fall suspiciously into a separate cluster. (If overplotting makes the labels hard to read, reduce the number of labels with the id.n argument of plot.earth.) As a general rule, it is worthwhile investigating cases with large residuals. Perhaps they should be excluded when building the model. On the other hand, it is possible that they reveal something important about the data that could warrant changes to the model. In our example it is also worthwhile looking at cases with small residuals because of non-linearity in that region. To see the example input matrix ordered on the magnitude of the residuals, use ozone1[order(abs(a$residuals)),].

Sometimes groups of residuals appear in a series of straight lines with slopes of -1. This effect is slightly visible in the example graph. These lines usually do not indicate a problem. They are are formed when a set of plotted points has the same observed value, commonly due to discretization in the measurement of the observed response.

The Normal Q-Q graph compares the distribution of the residuals to a normal distribution. If the residuals are distributed normally they will lie on the line. Following R convention, the abscissa is the normal axis and the ordinate is the residual axis; some popular books have it the other way round. In our example, we see divergence from normality in the left tail — the left tail of the distribution is fatter than that of a normal distribution. Once again, we see that cases 192, 193, and 226 have the largest residuals.

About plot.earth and earth-glm models

"Earth-glm" models are models created with a glm argument to earth.

In earth-glm models, much of the analysis in the above section does not apply because in these models the residuals are not assumed to have a normal distribution.

Note that the residuals plotted by plot.earth are residuals from earth's call to lm after the pruning pass, not glm residuals That is, plot.earth ignores the glm part of the model, if any.

For earth-glm models, plotd can be useful.

Why doesn't plot.earth print GLM information?

It's just too much to display. You can instead call plot on the glm.list in the earth model like this:

data(etitanic)
a <- earth(survived ~ ., data=etitanic, glm=list(family=binomial))
par(mfrow=c(2,2))
plot(a$glm.list[[1]])

I want to add lines or points to the RSq plot, and am having trouble getting my axis scaling right. Help?

Use do.par=FALSE. With do.par=FALSE, the axis scales match the axis labels. With do.par=TRUE, plot.earth restores the par parameters and axis scales to what they were before calling plot.earth. This usually means that the x- and y-axis scales are both 0 to 1.

See Also

earth, plot.earth.models, plotd, plotmo

Examples

data(ozone1)
a <- earth(O3 ~ ., data = ozone1, degree = 2)
plot(a)

[Package earth version 2.3-2 Index]