plot.variable {randomSurvivalForest}R Documentation

Plot of Ensemble Survival Effect of Variables

Description

Plot of ensemble mortality (or median survival) for each variable. Users can select between marginal and partial plots.

Usage

    plot.variable(x,
                  plots.per.page = 4,
                  granule = 5,
                  sorted = TRUE,
                  type = c("mort", "rel.freq", "surv", "time")[1],
                  partial = FALSE,
                  predictorNames = NULL,
                  npred = NULL,
                  npts = 25,
                  subset = NULL,
                  ...)

Arguments

x An object of class (rsf, grow) or (rsf, predict).
plots.per.page Integer value controlling page layout.
granule Integer value controlling whether a plot for a specific variable should be given as a boxplot or scatter plot. Larger values coerce boxplots.
sorted Should variables be sorted by importance values (only applies if importance values are available)? Default is TRUE.
type Select type of value to be plotted on the vertical axis. See details.
partial Logical. Should partial plots be created? Default is FALSE.
predictorNames Character vector of variable names. Only these variables will be plotted. Default is all.
npred Number of variables to be plotted (only applies when predictorNames=NULL). Default is all.
npts Maximum number of points used when generating partial plots for continuous variables.
subset An index vector indicating which rows should be used. Default is to use all the data.
... Further arguments passed to or from other methods.

Details

Mortality, relative frequency of mortality, median survival, or estimated survival times are plotted on the vertical axis (y-value) against a variable (x-value) on the horizontal axis. The choice of y-value is controlled by type. There are 4 different choices: mort is ensemble mortality, rel.freq is standardized mortality, surv is ensemble median survival, type is estimated survival time (this last option only applies to partial plots, however). For continuous variables, points are colored so that blue corresponds to events, whereas black points represent censored observations.

Ensemble mortality values should be interpreted in terms of total number of deaths. For example, if individual i has a mortality value of 100, then if all individuals were the same as i, we would expect to find 100 deaths on average in the data. If type is set to rel.freq, then mortality values are divided by an adjusted sample size, defined as the maximum of the sample size and the maximum mortality value. The standardized mortality values no longer indicate total deaths, but instead reflect relative mortality.

Partial plots are created when partial=TRUE. Interpretation for these are different than marginal plots. The y-value for a variable X, evaluated at X=x, is

tilde{f}(x) = frac{1}{n} sum_{i=1}^n hat{f}(x, x_{i,O}),

where x_{i,O} represents the value for all other variables other than X for individual i and hat{f} is the predicted value. Generating partial plots can be very slow. Choosing a small value for npts can speed up computational times as this restricts the number of distinct x values used in computing tilde{f}.

For continuous variables, red points are used to indicate partial values and dashed red lines represent a lowess smoothed error bar of +/- two standard errors. Black dashed line is the lowess estimate of the partial values. For discrete variables, partial values are indicated using boxplots with whiskers extending out approximately two standard errors from the mean. Standard errors are meant only to be a guide and should be interpreted with caution.

Partial plots can be slow. Setting type to time can greatly speed things up. Setting npts to a smaller number should also be tried.

Author(s)

Hemant Ishwaran hemant.ishwaran@gmail.com and Udaya B. Kogalur ubk2101@columbia.edu

References

H. Ishwaran, U.B. Kogalur (2007). Random survival forests for R, Rnews, 7/2:25-31.

J.H. Friedman (2001). Greedy function approximation: a gradient boosting machine, Ann. of Stat., 5:1189-1232.

A. Liaw and M. Wiener (2002). Classification and regression by randomForest, R News, 2:18-22.

See Also

rsf, predict.rsf.

Examples

data(veteran, package = "randomSurvivalForest") 
v.out <- rsf(Survrsf(time,status)~., veteran, forest = TRUE, ntree = 1000)
plot.variable(v.out, plots.per.page = 3)
plot.variable(v.out, plots.per.page = 2, predictorNames = c("trt", "karno", "age"))
plot.variable(v.out, type = "rel.freq", partial = TRUE, plots.per.page = 2, npred=3)

## Not run: 
# Fast partial plots using 'time' type.
# Top 8 predictors from PBC data.
data(pbc, package = "randomSurvivalForest") 
pbc.out <- rsf(Survrsf(days,status)~., pbc, ntree = 1000, forest = TRUE)
plot.variable(pbc.out, type = "time", partial = TRUE, npred=8)
## End(Not run)


[Package randomSurvivalForest version 3.2.3 Index]