rsf.default {randomSurvivalForest}R Documentation

Random Survival Forest Entry Point

Description

Ishwaran and Kogalur's Random Survival Forests algorithm for right censored survival data (Ishwaran and Kogalur, 2006). This is a direct extension of Breiman's Random Forests method (Breiman, 2001) to survival analysis settings. Algorithm uses a binary recursive tree growing procedure with different splitting rules for growing an ensemble cumulative hazard function. An out-of-bag (OOB) estimate of Harrell's concordance index (Harrell, 1982) is provided for assessing prediction. Importance values for predictors can also be computed. Prediction on test data is also available. Note that this is the default generic method for the package.

Usage

## Default S3 method:
rsf(formula,
    data = NULL,
    ntree = 1000,
    mtry = NULL,
    nodesize = NULL,
    splitrule = c("logrank", "conserve", "logrankscore", "logrankapprox")[1],
    importance = TRUE,
    predictorWt = NULL,
    forest = FALSE,
    do.trace = FALSE,
    proximity = FALSE, 
    seed = NULL,
    ntime = NULL,
    add.noise = FALSE,
    ...)

Arguments

formula A symbolic description of the model that is to be fit. The details of model specification are given below.
data Data frame containing the predictors (variables) in the model. Missing NA values are not encouraged. These are dealt with by removing the entire record if even one of its entries is NA (applies to those entries specifically called in formula).
ntree Number of trees to grow. This should not be set to a number too small, in order to ensure that every input row gets predicted at least a few times.
mtry Number of predictors randomly sampled at each split. The default is sqrt(p) where p equals the number of predictors.
nodesize Minimum number of deaths (must have unique event times) in a terminal node. Default equals min(3,ndead). Larger values cause smaller trees to be grown. The algorithm thus takes less time.
splitrule Splitting rule used for splitting nodes in growing the survival tree. Possible values are “logrank”, “conserve”, “logrankscore” and “logrankapprox”. Default value is “logrank”. See details below.
importance Logical. Should importance of predictors be estimated? Default is TRUE.
predictorWt Vector of positive weights where entry k represents the likehood of selecting predictor k as a candidate for splitting. Default is to use uniform weights. Vector must be of dimension p, where p equals the number of predictors.
forest Logical. Should the forest object be returned? Used for prediction on new data. Default is FALSE.
do.trace Logical. Should trace output be enabled? Default is FALSE. Integer values can also be passed. A positive value causes output to be printed each do.trace iteration.
proximity Logical. Should proximity measure between observations be calculated? Creates an nxn matrix (which can be huge). Default is FALSE.
seed Seed for random number generator. Must be a negative integer (the R wrapper handles incorrectly set seed values).
ntime Maximum number of desired distinct time points considered for evaluating ensemble. Default equals number of distinct events.
add.noise Logical. Should noise variable be added? Default is FALSE.
... Further arguments passed to or from other methods.

Details

The default rule, the “logrank” splitting rule, grows trees by splitting nodes by maximization of the log-rank test statistic (Leblanc and Crowley, 1993). The “conserve” splitting rule splits nodes by finding daughters closest to the conservation of events principle (the cumulative hazard summing to total number of deaths; see Naftel, Blackstone and Turner, 1985). The conservation of events splitting rule is less susceptible to an edge-effect bias as sometimes seen with the log-rank splitting rule. The “logrankscore” splitting rule uses a standardized log-rank statistic (Hothorn and Lausen, 2003). The “logrankapprox” splitting rule splits nodes using an approximation to the log-rank test (suggested by Michael Leblanc; also see Cox and Oakes, page 105).

All four rules often yield roughly the same prediction error performance, but users are encouraged to try all methods in any given example. The “logrankapprox” splitting rule is almost always fastest, especially with large data sets. After that, “conserve” is often second fastest. For very large data sets, discretizing continuous predictors and/or the observed survival times can greatly speed up computational times. Discretization does not have to be overly granular for substantial gains to be seen.

A typical formula has the form Survrsf(time, censoring) ~ terms, where time is survival time and censoring is a binary censoring indicator. Note that censoring must be coded as 0=censored and 1=death (event) and time must be strictly positive.

Predictors which are encoded as factors will be coerced into dummy variables. These dummy variables will be automatically labelled using the original predictor name. For example, if marital status is a predictor named “marital” encoded as a factor with levels “S”, “M” and “D”, two new dummy variables will be created labeled “maritalM” and “maritalS”.

Value

An object of class (rsf, grow), which is a list with the following components:

call The original call to rsf.
formula The formula used in the call.
n Sample size of the data.
ndead Number of deaths.
ntree Number of trees grown.
mtry Number of predictors randomly selected for splitting at each node.
nodesize Minimum size of terminal nodes.
splitrule Splitting rule used.
Time Vector of length n recording survival times.
Cens Vector of length n recording censoring information (0=censored, 1=death).
timeInterest Sorted unique event times. Ensemble values are given for these time points only.
predictorNames A character vector of the predictor names used in growing the forest.
predictorWt Vector of positive weights used for randomly sampling predictors for splitting.
predictors Matrix of predictors used to grow the forest.
ensemble A matrix of the ensemble cumulative hazard function with each row corresponding to an individual's CHF evaluated at each of the time points in timeInterest.
oob.ensemble Same as ensemble, but based on the OOB cumulative hazard function.
mortality A vector of length n, with each value representing the estimated ensemble mortality for an individual in the data. Ensemble mortality values should be interpreted in terms of total number of deaths.
oob.mortality Same as mortality, but based on oob.ensemble.
err.rate Vector of length ntree containing OOB error rates for the ensemble, with the b-th element being the error rate for the ensemble formed using the first b trees. Error rate is 1-C, where C is Harrell's concordance index. Error rates are between 0 and 1, with 0.5 representing the benchmark value of a procedure based on random guessing. A value of 0 is perfect.
leaf.count Number of terminal nodes for each tree in the forest. Vector of length ntree.
importance Importance measure of each predictor. For each predictor this is the difference in the OOB error rate when the predictor is randomly permuted compared to the OOB error rate without any permutation (i.e. the B-th component of err.rate). Large positive values indicate informative variables, whereas small values, or negative values, indicate predictors unlikely to be informative.
forest If forest=TRUE, the forest object is returned. This object can then be used for prediction with new data sets.
proximity If proximity=TRUE, a matrix of dimension nxn of proximity measures among the input is calculated (based on the frequency that pairs of data points are in the same terminal nodes). Value returned is a vector of the lower diagonal of the matrix. Use plot.proximity() to extract this information.

Note

The key deliverable is the matrix ensemble which contains the estimated ensemble cumulative hazard function for each individual evaluated at a set of distinct time points (an OOB ensemble, oob.ensemble, is also returned). The vector mortality (likewise oob.mortality) is a weighted sum over the columns of ensemble, weighted by the number of individuals at risk at the different time points. Entry i of the vector represents the estimated total mortality of individual i in terms of total number of deaths. In other words, if i has a mortality value of 100, then if all individuals had the same predictors as i, there would be on average 100 deaths in the dataset.

Different R wrappers are provided with the package to aid in interpreting the ensemble.

Author(s)

Hemant Ishwaran hemant.ishwaran@gmail.com and Udaya B. Kogalur ubk2101@columbia.edu

References

H. Ishwaran and Udaya B. Kogalur (2006). Random Survival Forests. Cleveland Clinic Technical Report.

L. Breiman (2001). Random forests, Machine Learning, 45:5-32.

F.E. Harrell et al. (1982). Evaluating the yield of medical tests, J. Amer. Med. Assoc., 247:2543-2546.

M. LeBlanc and J. Crowley (1993). Survival trees by goodness of split, J. Amer. Stat. Assoc., 88:457-467.

D.C. Naftel, E.H. Blackstone and M.E. Turner (1985). Conservation of events, unpublished notes.

T. Hothorn and B. Lausen (2003). On the exact distribution of maximally selected rank statistics, Computational Statistics & Data Analysis, 43:121-137.

D.R. Cox and D. Oakes (1988). Analysis of Survival Data, Chapman and Hall.

A. Liaw and M. Wiener (2002). Classification and regression by randomForest, R News, 2:18-22.

See Also

plot.ensemble, plot.variable, plot.error, plot.proximity, predict.rsf, print.rsf, pmml_to_rsf, rsf_to_pmml, Survrsf.

Examples

# Example 1:  Veteran's Administration lung cancer trial from
# Kalbfleisch & Prentice.  Randomized trial of two treatment
# regimens for lung cancer.  Minimal argument call.  Print
# results, then plot error rate and importance values.

data(veteran, package = "randomSurvivalForest")
veteran.out <- rsf(Survrsf(time, status)~., data = veteran)
print.rsf(veteran.out)
plot.error(veteran.out)

# Example 2:  Richer argument call.
# Note that forest option is set to true to illustrate 
# how one might use 'rsf' for prediction (see 'rsf.predict'
# for more details).

data(veteran, package = "randomSurvivalForest")
veteran.f <- as.formula(Survrsf(time, status)~.)
ntree <- 500
mtry <- 2
nodesize <- 3
splitrule <- "logrank"
proximity <- TRUE
forest <- TRUE
seed <- -1
ntime <- NULL
do.trace <- 25
veteran2.out <- rsf(veteran.f, veteran, ntree, mtry, nodesize, splitrule,
                    proximity = proximity,
                    forest = forest,
                    seed = seed,
                    ntime = ntime,
                    do.trace = do.trace)
print.rsf(veteran2.out)
plot.proximity(veteran2.out)

# Take a peek at the forest ...
head(veteran2.out$forest$nativeArray)

# Partial plot of mortality for sorted predictors.
# Set partial to FALSE for a faster plot.
plot.variable(veteran2.out, 3, partial = TRUE)

# Example 3:  Veteran data again. Look specifically at
# Karnofsky performance score.  Compare to Kaplan-Meier.
# Assumes "survival" library is loaded.

if (library("survival", logical.return = TRUE))
{
        data(veteran, package = "randomSurvivalForest")
        veteran3.out <- rsf(Survrsf(time, status)~karno,
                       veteran,
                       ntree = 1000)
        plot.ensemble(veteran3.out)
        par(mfrow = c(1,1))
        plot(survfit(Surv(time, status)~karno, data = veteran))
}

# Example 4:  Primary biliary cirrhosis (PBC) of the liver.
# Data found in Appendix D.1 of Fleming and Harrington, Counting
# Processes and Survival Analysis, Wiley, 1991 (only differences
# are that age is in days and sex and stage variables are not
# missing for observations 313-418).  

data(pbc, package = "randomSurvivalForest") 
pbc.out <- rsf(Survrsf(days,status)~., pbc, ntree = 1000)
print(pbc.out)

# Example 5:  Compare Cox regression to Random Survival Forests
# for PBC data.  Compute OOB estimate of Harrell's concordance 
# index for Cox regression using B = 100 bootstrap draws.
# Assumes "Hmisc" and "survival" libraries are loaded. 

if (library("survival", logical.return = TRUE) 
    & library("Hmisc", logical.return = TRUE))
{
    data(pbc, package = "randomSurvivalForest")
    pbc2.out <- rsf(Survrsf(days,status)~., pbc, mtry = 2, ntree = 1000)
    B <- 100 
    cox.err <- rep(NA, B) 
    cox.f <- as.formula(Surv(days,status)~.)  
    pbc.data <- pbc[apply(is.na(pbc), 1, sum) == 0,] ##remove NA's 
    cat("Out-of-bag Cox Analysis ...", "\n")
    for (b in 1:B) {
        cat("Cox bootstrap", b, "\n") 
        bag.sample <- sample(1:dim(pbc.data)[1],
                             dim(pbc.data)[1],
                             replace = TRUE) 
        oob.sample <- setdiff(1:dim(pbc.data)[1], bag.sample)
        train <- pbc.data[bag.sample,]
        test <- pbc.data[oob.sample,]
        cox.out <- coxph(cox.f, train)
        cox.out <- tryCatch({coxph(cox.f, train)}, error=function(ex){NULL})
        if (is.list(cox.out)) {
          cox.predict <- predict(cox.out, test)
          cox.err[b] <- rcorr.cens(cox.predict, 
                 Surv(pbc.data$days[oob.sample], pbc.data$status[oob.sample]))[1]
        }
     }
     cat("Error rates:", "\n")
     cat("Random Survival Forests:", pbc2.out$err.rate[pbc2.out$ntree], "\n")
     cat("         Cox Regression:", mean(cox.err, na.rm = TRUE), "\n")
}

# Example 6:  Using an external data set.
    ## Not run: 
        file.in <- "other.data"
        other.data <- read.table(file.in, header = TRUE)
        rsf.f <- as.formula(Survrsf(time, status)~.)
        rsf.out <- rsf(formula = rsf.f, data = other.data) 
    
## End(Not run)

[Package randomSurvivalForest version 2.0.0 Index]