rsf.default {randomSurvivalForest} | R Documentation |
Ishwaran and Kogalur's Random Survival Forests algorithm for right censored survival data (Ishwaran and Kogalur, 2006). This is a direct extension of Breiman's Random Forests method (Breiman, 2001) to survival analysis settings. Algorithm uses a binary recursive tree growing procedure with different splitting rules for growing an ensemble cumulative hazard function. An out-of-bag (OOB) estimate of Harrell's concordance index (Harrell, 1982) is provided for assessing prediction. Importance values for predictors can also be computed. Prediction on test data is also available. Note that this is the default generic method for the package.
## Default S3 method: rsf(formula, data = NULL, ntree = 1000, mtry = NULL, nodesize = NULL, splitrule = c("logrank", "conserve", "logrankscore", "logrankapprox")[1], importance = TRUE, big.data = FALSE, predictorWt = NULL, forest = FALSE, do.trace = FALSE, proximity = FALSE, seed = NULL, ntime = NULL, add.noise = FALSE, ...)
formula |
A symbolic description of the model that is to be fit. The details of model specification are given below. |
data |
Data frame containing the predictors (variables) in the
model. Missing NA values are not encouraged. These are dealt with
by removing the entire record if even one of its entries is NA
(applies to those entries specifically called in formula ). |
ntree |
Number of trees to grow. This should not be set to a number too small, in order to ensure that every input row gets predicted at least a few times. |
mtry |
Number of predictors randomly sampled at each split.
The default is sqrt(p ) where p equals the number of
predictors. |
nodesize |
Minimum number of deaths (must have unique event
times) in a terminal node. Default equals
min(3,ndead ). Larger values cause smaller trees to be
grown. The algorithm thus takes less time. |
splitrule |
Splitting rule used for splitting nodes in growing the survival tree. Possible values are “logrank”, “conserve”, “logrankscore” and “logrankapprox”. Default value is “logrank”. See details below. |
importance |
Logical. Should importance of predictors be estimated? Default is TRUE. |
big.data |
Logical. Set this value to true when the number of predictors
p is very large, or the data is very large. See details below. |
predictorWt |
Vector of positive weights where entry k
represents the likehood of selecting predictor
k as a candidate for splitting. Default is to use uniform
weights. Vector must be of dimension p , where p
equals the number of predictors. |
forest |
Logical. Should the forest object be returned? Used for prediction on new data. Default is FALSE. |
do.trace |
Logical. Should trace output be enabled? Default is
FALSE. Integer values can also be passed. A positive value
causes output to be printed each do.trace iteration. |
proximity |
Logical. Should proximity measure between
observations be calculated? Creates an n xn
matrix (which can be huge). Default is FALSE. |
seed |
Seed for random number generator. Must be a negative integer (the R wrapper handles incorrectly set seed values). |
ntime |
Maximum number of desired distinct time points considered for evaluating ensemble. Default equals number of distinct events. |
add.noise |
Logical. Should noise variable be added? Default is FALSE. |
... |
Further arguments passed to or from other methods. |
The default rule, the “logrank” splitting rule, grows trees by splitting nodes by maximization of the log-rank test statistic (Leblanc and Crowley, 1993). The “conserve” splitting rule splits nodes by finding daughters closest to the conservation of events principle (the cumulative hazard summing to total number of deaths; see Naftel, Blackstone and Turner, 1985). The conservation of events splitting rule is less susceptible to an edge-effect bias as sometimes seen with the log-rank splitting rule. The “logrankscore” splitting rule uses a standardized log-rank statistic (Hothorn and Lausen, 2003). The “logrankapprox” splitting rule splits nodes using an approximation to the log-rank test (suggested by Michael Leblanc; also see Cox and Oakes, page 105).
All four rules often yield roughly the same prediction error performance, but users are encouraged to try all methods in any given example. The “logrankapprox” splitting rule is almost always fastest, especially with large data sets. After that, “conserve” is often second fastest. For very large data sets, discretizing continuous predictors and/or the observed survival times can greatly speed up computational times. Discretization does not have to be overly granular for substantial gains to be seen.
A typical formula has the form Survrsf(time, censoring) ~
terms
, where time
is survival time and censoring
is a
binary censoring indicator. Note that censoring must be coded as
0=censored and 1=death (event) and time
must be strictly
positive.
Predictors which are encoded as factors will be coerced into dummy variables. These dummy variables will be automatically labelled using the original predictor name. For example, if marital status is a predictor named “marital” encoded as a factor with levels “S”, “M” and “D”, two new dummy variables will be created labeled “maritalM” and “maritalS”.
For very large data sets, or data with a large number of
predictors, users should consider setting the logical flag
big.data
to TRUE. This bypasses the large overhead needed by R
in creating design matrices and parsing formula. Users should be
aware, however, that predictors are not processed and are interpreted
as is when this option is turned on. Think of the data frame
as containing time and censoring information and the rest of the data
as the pre-processed design matrix when this option is on. Side
effects are that factors are not transformed to dummy values (in fact
they are coerced to NAs), and transformations used in the formula
(such as logs etc.) are ignored.
An object of class (rsf, grow)
, which is a list with the
following components:
call |
The original call to rsf . |
formula |
The formula used in the call. |
n |
Sample size of the data. |
ndead |
Number of deaths. |
ntree |
Number of trees grown. |
mtry |
Number of predictors randomly selected for splitting at each node. |
nodesize |
Minimum size of terminal nodes. |
splitrule |
Splitting rule used. |
Time |
Vector of length n recording survival times. |
Cens |
Vector of length n recording censoring
information (0=censored, 1=death). |
timeInterest |
Sorted unique event times. Ensemble values are given for these time points only. |
predictorNames |
A character vector of the predictor names used in growing the forest. |
predictorWt |
Vector of positive weights used for randomly sampling predictors for splitting. |
predictors |
Matrix of predictors used to grow the forest. |
ensemble |
A matrix of the ensemble cumulative hazard
function with each row corresponding to an individual's CHF
evaluated at each of the time points in timeInterest . |
oob.ensemble |
Same as ensemble , but based on the OOB
cumulative hazard function. |
mortality |
A vector of length n , with each value
representing the estimated ensemble mortality for an
individual in the data. Ensemble mortality values should
be interpreted in terms of total number of deaths. |
oob.mortality |
Same as mortality , but based on oob.ensemble . |
err.rate |
Vector of length ntree containing OOB error
rates for the ensemble, with the b-th element being the error
rate for the ensemble formed using the first b trees. Error
rate is 1-C, where C is Harrell's concordance index. Error
rates are between 0 and 1, with 0.5 representing the benchmark
value of a procedure based on random guessing. A value of 0
is perfect. |
leaf.count |
Number of terminal nodes for each tree in the forest. Vector
of length ntree . |
importance |
Importance measure of each predictor. For each
predictor this is the difference
in the OOB error rate when the predictor is randomly permuted
compared to the OOB error rate without any permutation (i.e. the
B-th component of err.rate ). Large positive values
indicate informative variables, whereas small values, or negative
values, indicate predictors unlikely to be informative. |
forest |
If forest =TRUE, the forest object is returned.
This object can then be used for prediction with new data sets. |
proximity |
If proximity =TRUE, a matrix
of dimension n xn of proximity measures among the input is
calculated (based on the frequency that pairs of data points
are in the same terminal nodes). Value returned is a vector of
the lower diagonal of the matrix. Use
plot.proximity() to extract this information. |
The key deliverable is the matrix ensemble
which contains the
estimated ensemble cumulative hazard function for each individual
evaluated at a set of distinct time points (an OOB ensemble,
oob.ensemble
, is also returned). The vector mortality
(likewise oob.mortality
) is a weighted sum over the columns of
ensemble
, weighted by the number of individuals at risk at
the different time points. Entry i
of the vector represents
the estimated total mortality of individual i
in terms of
total number of deaths. In other words, if i
has a mortality
value of 100, then if all individuals had the same predictors as
i
, there would be on average 100 deaths in the dataset.
Different R wrappers are provided with the package to aid in interpreting the ensemble.
Hemant Ishwaran hemant.ishwaran@gmail.com and Udaya B. Kogalur ubk2101@columbia.edu
H. Ishwaran and Udaya B. Kogalur (2006). Random Survival Forests. Cleveland Clinic Technical Report.
L. Breiman (2001). Random forests, Machine Learning, 45:5-32.
F.E. Harrell et al. (1982). Evaluating the yield of medical tests, J. Amer. Med. Assoc., 247:2543-2546.
M. LeBlanc and J. Crowley (1993). Survival trees by goodness of split, J. Amer. Stat. Assoc., 88:457-467.
D.C. Naftel, E.H. Blackstone and M.E. Turner (1985). Conservation of events, unpublished notes.
T. Hothorn and B. Lausen (2003). On the exact distribution of maximally selected rank statistics, Computational Statistics & Data Analysis, 43:121-137.
D.R. Cox and D. Oakes (1988). Analysis of Survival Data, Chapman and Hall.
A. Liaw and M. Wiener (2002). Classification and regression by randomForest, R News, 2:18-22.
plot.ensemble
,
plot.variable
,
plot.error
,
plot.proximity
,
predict.rsf
,
print.rsf
,
find.interaction
,
pmml_to_rsf
,
rsf_to_pmml
,
Survrsf
.
# Example 1: Veteran's Administration lung cancer trial from # Kalbfleisch & Prentice. Randomized trial of two treatment # regimens for lung cancer. Minimal argument call. Print # results, then plot error rate and importance values. data(veteran, package = "randomSurvivalForest") veteran.out <- rsf(Survrsf(time, status)~., data = veteran) print.rsf(veteran.out) plot.error(veteran.out) # Example 2: Richer argument call. # Note that forest option is set to true to illustrate # how one might use 'rsf' for prediction (see 'rsf.predict' # for more details). data(veteran, package = "randomSurvivalForest") veteran.f <- as.formula(Survrsf(time, status)~.) ntree <- 500 mtry <- 2 nodesize <- 3 splitrule <- "logrank" proximity <- TRUE forest <- TRUE seed <- -1 ntime <- NULL do.trace <- 25 veteran2.out <- rsf(veteran.f, veteran, ntree, mtry, nodesize, splitrule, proximity = proximity, forest = forest, seed = seed, ntime = ntime, do.trace = do.trace) print.rsf(veteran2.out) plot.proximity(veteran2.out) # Take a peek at the forest ... head(veteran2.out$forest$nativeArray) # Partial plot of mortality for sorted predictors. # Set partial to FALSE for a faster plot. plot.variable(veteran2.out, 3, partial = TRUE) # Example 3: Veteran data again. Look specifically at # Karnofsky performance score. Compare to Kaplan-Meier. # Assumes "survival" library is loaded. if (library("survival", logical.return = TRUE)) { data(veteran, package = "randomSurvivalForest") veteran3.out <- rsf(Survrsf(time, status)~karno, veteran, ntree = 1000) plot.ensemble(veteran3.out) par(mfrow = c(1,1)) plot(survfit(Surv(time, status)~karno, data = veteran)) } # Example 4: Primary biliary cirrhosis (PBC) of the liver. # Data found in Appendix D.1 of Fleming and Harrington, Counting # Processes and Survival Analysis, Wiley, 1991 (only differences # are that age is in days and sex and stage variables are not # missing for observations 313-418). data(pbc, package = "randomSurvivalForest") pbc.out <- rsf(Survrsf(days,status)~., pbc, ntree = 1000) print(pbc.out) # Example 5: Compare Cox regression to Random Survival Forests # for PBC data. Compute OOB estimate of Harrell's concordance # index for Cox regression using B = 100 bootstrap draws. # Assumes "Hmisc" and "survival" libraries are loaded. if (library("survival", logical.return = TRUE) & library("Hmisc", logical.return = TRUE)) { data(pbc, package = "randomSurvivalForest") pbc2.out <- rsf(Survrsf(days,status)~., pbc, mtry = 2, ntree = 1000) B <- 100 cox.err <- rep(NA, B) cox.f <- as.formula(Surv(days,status)~.) pbc.data <- pbc[apply(is.na(pbc), 1, sum) == 0,] ##remove NA's cat("Out-of-bag Cox Analysis ...", "\n") for (b in 1:B) { cat("Cox bootstrap", b, "\n") bag.sample <- sample(1:dim(pbc.data)[1], dim(pbc.data)[1], replace = TRUE) oob.sample <- setdiff(1:dim(pbc.data)[1], bag.sample) train <- pbc.data[bag.sample,] test <- pbc.data[oob.sample,] cox.out <- coxph(cox.f, train) cox.out <- tryCatch({coxph(cox.f, train)}, error=function(ex){NULL}) if (is.list(cox.out)) { cox.predict <- predict(cox.out, test) cox.err[b] <- rcorr.cens(cox.predict, Surv(pbc.data$days[oob.sample], pbc.data$status[oob.sample]))[1] } } cat("Error rates:", "\n") cat("Random Survival Forests:", pbc2.out$err.rate[pbc2.out$ntree], "\n") cat(" Cox Regression:", mean(cox.err, na.rm = TRUE), "\n") } # Example 6: Using an external data set. ## Not run: file.in <- "other.data" other.data <- read.table(file.in, header = TRUE) rsf.f <- as.formula(Survrsf(time, status)~.) rsf.out <- rsf(formula = rsf.f, data = other.data) ## End(Not run)