cforest {party}R Documentation

Random Forest

Description

An implementation of the random forest and bagging ensemble algorithms utilizing conditional inference trees as base learners.

Usage

cforest(formula, data = list(), subset = NULL, weights = NULL, 
        controls = cforest_unbiased(),
        xtrafo = ptrafo, ytrafo = ptrafo, scores = NULL)
varimp(object, mincriterion = 0.0)
proximity(object)

Arguments

formula a symbolic description of the model to be fit.
data an data frame containing the variables in the model.
subset an optional vector specifying a subset of observations to be used in the fitting process.
weights an optional vector of weights to be used in the fitting process. Non-negative integer valued weights are allowed as well as non-negative real weights. Observations are sampled (with or without replacement) according to probabilities weights / sum(weights). The fraction of observations to be sampled (without replacement) is computed based on the sum of the weights if all weights are integer-valued and based on the number of weights greater zero else.
controls an object of class ForestControl-class, which can be obtained using cforest_control (and its convenience interface cforest_classical and cforest_unbiased).
xtrafo a function to be applied to all input variables. By default, the ptrafo function is applied.
ytrafo a function to be applied to all response variables. By default, the ptrafo function is applied.
scores an optional named list of scores to be attached to ordered factors.
object an object as returned by cforest.
mincriterion the value of the test statistic or 1 - p-value that must be exceeded in order make use of a split. See ctree_control.

Details

This implementation of the random forest (and bagging) algorithm differs from the reference implementation in randomForest with respect to the base learner used and the aggregation scheme applied.

Conditional inference trees, see ctree, are fitted to each of the ntree (defined via cforest_control) bootstrap samples of the learning sample. There are many hyper parameters that can be controlled, see cforest_control. You MUST NOT change anything you don't understand completely.

The aggregation scheme works by averaging observation weights extracted from each of the ntree trees and NOT by averaging predictions directly. See Hothorn et al. (2004) for a description. Predictions can be computed using predict. For observations with zero weights, predictions are computed from the fitted tree when newdata = NULL.

Ensembles of conditional inference trees have not yet been extensively tested, so this routine is meant for the expert user only and its current state is rather experimental. However, there are some things that can't be done with randomForest, for example fitting forests to censored response variables or to multivariate and ordered responses.

By default, unbiased trees (see Strobl et al., 2007) are used and five inputs are randomly examined for possible splits in each node (mtry is a hyper parameter and needs to be defined by the user). The defaults can be changed by specifying hyper parameters via the cforest_control function to the control argument of cforest. Function varimp can be used to compute variable importance measures similar to those computed by importance.

The proximity matrix is an n times n matrix P with P_{ij} equal to the fraction of trees where observations i and j are element of the same terminal node (when both i and j had non-zero weights in the same bootstrap sample).

Value

An object of class RandomForest-class.

References

Leo Breiman (2001). Random Forests. Machine Learning, 45(1), 5–32.

Torsten Hothorn, Berthold Lausen, Axel Benner and Martin Radespiel-Troeger (2004). Bagging Survival Trees. Statistics in Medicine, 23(1), 77–91.

Torsten Hothorn, Peter Buhlmann, Sandrine Dudoit, Annette Molinaro and Mark J. van der Laan (2006). Survival Ensembles. Biostatistics, 7(3), 355–373.

Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis and Torsten Hothorn (2007). Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution. BMC Bioinformatics, 8, 25. http://www.biomedcentral.com/1471-2105/8/25/abstract

Examples


    ### honest (i.e., out-of-bag) cross-classification of
    ### true vs. predicted classes
    table(mammoexp$ME, predict(cforest(ME ~ ., data = mammoexp, 
                               control = cforest_classical(ntree = 50)),
                               OOB = TRUE))

    ### fit forest to censored response
    if (require("ipred")) {

        data("GBSG2", package = "ipred")
        bst <- cforest(Surv(time, cens) ~ ., data = GBSG2, 
                   control = cforest_classical(ntree = 50))

        ### estimate conditional Kaplan-Meier curves
        treeresponse(bst, newdata = GBSG2[1:2,], OOB = TRUE)

        ### if you can't resist to look at individual trees ...
        party:::prettytree(bst@ensemble[[1]], names(bst@data@get("input")))
    }

    ### proximity, see ?randomForest
    iris.cf <- cforest(Species ~ ., data = iris, 
                       control = cforest_unbiased(mtry = 2))
    iris.mds <- cmdscale(1 - proximity(iris.cf), eig = TRUE)
    op <- par(pty="s")
    pairs(cbind(iris[,1:4], iris.mds$points), cex = 0.6, gap = 0, 
          col = c("red", "green", "blue")[as.numeric(iris$Species)],
          main = "Iris Data: Predictors and MDS of Proximity Based on cforest")
    par(op)


[Package party version 0.9-96 Index]