rsf {randomSurvivalForest} | R Documentation |
rsf implements Ishwaran and Kogalur's Random Survival Forests algorithm for right censored survival data. The algorithm uses a binary recursive tree growing procedure with different splitting rules for growing an ensemble cumulative hazard function. An out-of-bag (OOB) estimate of Harrell's concordance index is provided for assessing prediction.
rsf(formula = formula(data), data = sys.parent(), ntree = 100, mtry = NULL, nodesize = NULL, ntime = NULL, splitrule = c("conserve", "logrank"), proximity = FALSE, seed = NULL, ntime.upper = 1000, trace = FALSE, ...)
formula |
A symbolic description of the model that is to be fit. The details of model specification are given below. |
data |
Data frame containing the predictors (variables) in the model. Missing NA values are not encouraged. These are dealt with by removing the entire record if even one of its entries is NA. |
ntree |
Number of trees to grow. This should not be set to a number too small, in order to ensure that every input row gets predicted at least a few times. |
mtry |
Number of predictors randomly sampled as candidates at each split. The default is sqrt(p) where p is the number of predictors. |
nodesize |
Minimum number of deaths (must have unique event
times) in a terminal node. Default equals
min(3,ndead ). Larger values cause
smaller trees to be grown. The algorithm thus takes less time. |
ntime |
Maximum number of distinct time points considered for evaluating ensemble. Default equals number of distinct events. |
splitrule |
Splitting rule used for splitting nodes in growing the survival tree. Possible values are “logrank” and “conserve”. Default value is “conserve”. See details below. |
proximity |
Logical. Should proximity measure between
observations be calculated? Creates an n xn
matrix (which can be huge). Default is FALSE. |
seed |
Seed for random number generator. Must be a negative integer (the R wrapper handles incorrectly set seed values). |
ntime.upper |
Upper limit allowable for ntime . |
trace |
Logical. Should trace output be enabled? Default is FALSE. |
... |
Further arguments passed to or from other methods. |
A typical formula has the form Survrsf(time, censoring) ~ terms
where time
is survival time and censoring
is a binary
censoring indicator. Note that censoring must be coded as 0=censored
and 1=death (event). Unlike standard formula calls in R, a
specification of the form first:second
, used to indicate terms
obtained by taking the interactions of all terms in first
with
all terms in second
, will be interpreted by rsf to simply
mean first+second
. This is because interactions are implicit
in growing random forests. Attempts to coerce terms to include higher
order terms using functions from other packages like bs()
or
poly()
are also fruitless. A formula such as
Survrsf(time, censoring) ~ bs(term1, df=5) + term2
is interpreted
exactly the same as Survrsf(time, censoring) ~ term1 + term2
.
Predictors which are encoded as factors will be coerced into dummy variables. These dummy variables will be automatically labelled using the original predictor name. For example, if marital status is a predictor named “marital” encoded as a factor with levels “S”, “M” and “D”, two new dummy variables will be created labeled “maritalM” and “maritalS”.
The “logrank” splitting rule grows trees by splitting nodes by maximization of the log-rank test statistic. The default splitting rule is “conserve” which splits nodes by finding daughters closest to the conservation of events principle (the cumulative hazard summing to total number of deaths; see references below). The conservation of events splitting rule is less susceptible to an edge-effect bias as sometimes seen with the log-rank splitting rule. Both rules often yield roughly the same prediction error performance, although “logrank” is sometimes slightly better if proportional hazards is not too severely violated. The “conserve” splitting rule is faster, especially with larger data sets.
For very large data sets, discretizing continuous predictors can greatly speed up computational times. Discretization does not have to be overly granular for gains to be seen.
An object of class randomSurvivalForest
, which is a list with the
following components:
call |
The original call to rsf. |
formula |
The original formula used in calling rsf. |
terms |
The terms used in formula. |
n |
Sample size of the data. |
ndead |
Number of deaths. |
ntree |
Number of trees grown. |
mtry |
Number of predictors sampled for splitting at each node. |
nodesize |
Minimum size of terminal nodes. |
splitrule |
Splitting rule used. |
Time |
Vector of length n recording survival times. |
Died |
Vector of length n recording censoring
information (0=censored, 1=death). |
Time.unique |
Sorted unique event times. Ensemble values are given for these time points only. |
prednames |
A character vector of predictor names used to grow the forest. |
predictors |
Matrix of predictors used to grow the forest. |
ensemble |
A matrix of the ensemble cumulative hazard
function with each row corresponding to an individual's CHF
evaluated at each of the time points in Time.unique . |
mortality |
A vector of length n , with each value
representing the estimated ensemble mortality for an
individual in the data. Ensemble mortality values should
be interpreted in terms of total number of deaths. |
err.rate |
Vector of length ntree containing OOB error
rates for the ensemble, with the b-th element being the error
rate for the ensemble formed using the first b trees. Error
rate is 1-C, where C is Harrell's concordance index. Error
rates are between 0 and 1, with 0.5 representing the benchmark
value of a procedure based on random guessing. A value of 0
is perfect. |
leaf.count |
Number of terminal nodes for each tree in the forest. Vector
of length ntree . |
proximity |
If proximity =TRUE, a matrix
of dimension n xn of proximity measures among the input is
calculated (based on the frequency that pairs of data points
are in the same terminal nodes). Value returned is a vector of
the lower diagonal of the matrix. See the call
plot.proximity() for how to extract information. |
The key deliverable is the matrix ensemble
which contains the
estimated ensemble cumulative hazard function for each individual
evaluated at a set of distinct time points. The vector
mortality
is a weighted sum over the columns of
ensemble
, weighted by the number of individuals at risk at
the different time points. Entry i
of the vector represents
the estimated total mortality of individual i
in terms of
total number of deaths. In other words, if i
has a mortality
value of 100, then if all individuals had the same predictor as
i
, there would be on average 100 deaths in the dataset.
Different R wrappers are provided with the package to aid in interpreting the ensemble.
Hemant Ishwaran hemant.ishwaran@gmail.com and Udaya B. Kogalur ubk2101@columbia.edu
H. Ishwaran and Udaya B. Kogalur (2006). Random Survival Forests. Cleveland Clinic Technical Report.
L. Breiman (2001). Random forests, Machine Learning, 45:5-32.
F.E. Harrell et al. (1982). Evaluating the yield of medical tests, J. Amer. Med. Assoc., 247:2543-2546.
D.C. Naftel, E.H. Blackstone and M.E. Turner (1985). Conservation of events, unpublished notes.
print.rsf
,
plot.ensemble
,
plot.variable
,
plot.error
,
plot.proximity
Survrsf
.
# Example 1: Minimal argument call using included sample data set. data(veteran, package = "randomSurvivalForest") veteran.out <- rsf(Survrsf(time, status)~., data = veteran) print.rsf(veteran.out) plot.error(veteran.out) # Example 2: Full argument call using included sample data set. veteran.f <- as.formula(Survrsf(time, status)~.) data(veteran) ntree <- 100 mtry <- 2 nodesize <- 3 ntime <- NULL splitrule <- "logrank" proximity <- FALSE seed <- -1 ntime.upper <- 1000 trace <- TRUE veteran2.out <- rsf(veteran.f, veteran, ntree, mtry, nodesize, ntime, splitrule, proximity, seed, ntime.upper, trace) print.rsf(veteran2.out) # Example 3: Veteran's Administration lung cancer trial from # Kalbfleisch & Prentice. Randomized trial of two treatment # regimens for lung cancer. Assumes "survival" # library is loaded. This is a standard survival analysis # data set, included in the package. if (library("survival", logical.return = TRUE)) { data(veteran, package = "randomSurvivalForest") veteran3.out <- rsf(Survrsf(time, status)~ celltype, veteran, ntree = 1000) plot.ensemble(veteran3.out) par(mfrow = c(1,1)) plot(survfit(Surv(time, status) ~ celltype, data = veteran)) } # Example 4: Primary biliary cirrhosis (PBC) of the liver. # Data found in Appendix D.1 of Fleming and Harrington, Counting # Processes and Survival Analysis, Wiley, 1991 (only differences # are that age is in days and sex and stage variables are not # missing for observations 313-418). This is a sample data # set included in the package. data(pbc, package = "randomSurvivalForest") pbc.out <- rsf(Survrsf(days,status)~., pbc, ntree = 1000) print.rsf(pbc.out) plot.variable(pbc.out) ### par(ask = TRUE) might be useful here # Example 5: Compare Cox regression to Random Survival Forests # using the PBC data. Compute OOB estimate of Harrell's # concordance index (B = 100). Assumes "Hmisc" and "survival" # libraries are loaded. if (library("survival",logical.return = TRUE) & library("Hmisc",logical.return = TRUE)) { data(pbc, package = "randomSurvivalForest") pbc2.out <- rsf(Survrsf(days,status)~., pbc, mtry = 2, ntree = 1000) B <- 100 cox.err <- 0 cox.f <- as.formula(Surv(days,status)~.) pbc.data <- pbc[apply(is.na(pbc), 1, sum) == 0,] ##remove NA's cat("Out-of-bag Cox Analysis ...", "\n") for (b in 1:B){ cat("Cox bootstrap", b, "\n") bag.sample <- sample(1:dim(pbc.data)[1], dim(pbc.data)[1], replace = TRUE) oob.sample <- setdiff(1:dim(pbc.data)[1],bag.sample) train <- pbc.data[bag.sample,] test <- pbc.data[oob.sample,] cox.out <- coxph(cox.f, train) cox.predict <- predict(cox.out, test) cox.err <- cox.err + rcorr.cens(cox.predict, Surv(pbc.data$days[oob.sample], pbc.data$status[oob.sample]) )[1] } cat("Error rates:", "\n") cat("Random Survival Forests:", pbc2.out$err.rate[pbc2.out$ntree], "\n") cat(" Cox Regression:", cox.err[1]/B, "\n") } # Example 6: Using an external data set. ## Not run: file.in <- "other.data" other.data <- read.table(file.in, header = TRUE) rsf.f <- as.formula(Survrsf(time, status)~.) rsf.out <- rsf(formula = rsf.f, data = other.data) ## End(Not run)