bark {bark}R Documentation

Bayesian Additive Regression Kernels

Description

BARK is a Bayesian sum-of-kernels model.
For numeric response y, we have y = f(x) + e, where e ~ N(0,sigma^2).
For a binary response y, P(Y=1 | x) = F(f(x)), where F denotes the standard normal cdf (probit link).

In both cases, f is the sum of many Gaussian kernel functions. The goal is to have very flexible inference for the unknown function f. It uses an approximated Cauchy process as the prior distribution for the unknown function f.

Feature selection can be achieved through the inference on the scale parameters in the Gaussian kernels. BARK accepts four different types of prior distributions, e, d, se, sd, enabling either soft shrinkage or hard shrinkage for the scale parameters.

Usage

bark(x.train, y.train, x.test = matrix(0, 0, 0),
     type = "se", classification = FALSE,
     keepevery = 100, nburn = 100, nkeep = 100,
     printevery = 1000, keeptrain = FALSE,
     fixed = list(),
     tune = list(lstep=0.5, frequL=.2,
                 dpow=1, upow=0, varphistep=.5, phistep=1),
     theta = list()
    )

Arguments

x.train Explanatory variables for training (in sample) data.
Must be a matrix of doubles, with (as usual) rows corresponding to observations and columns to variables.
y.train Dependent variable for training (in sample) data.
If y is numeric a continous response model is fit (normal errors).
If y is a logical (or just has values 0 and 1), then a binary response model with a probit link is fit.
x.test Explanatory variables for test (out of sample) data.
Should have same structure as x.train.
type BARK type, e, d, se, or sd, default choice is se.
e: BARK with equal weights.
d: BARK with diffeerent weights.
se: BARK with selection and equal weights.
sd: BARK with selection and different weights.
classification True false logical variable, indicating a classfication or regression problem.
keepevery Every keepevery draw is kept to be returned to the user.
nburn Number of MCMC iterations (nburntimeskeepevery) to be treated as burn in.
nkeep Number of MCMC iterations kept for the posterior inference.
nkeeptimeskeepevery iterations after the burn in.
printevery As the MCMC runs, a message is printed every printevery draws.
keeptrain Logical, whether to keep results for training samples.
fixed A list of fixed hyperparameters, using the default values if not sepecified.
alpha = 1: stable index, must be 1.
eps = 0.5: approximation parameter.
gam = 5: intensity parameter.
la = 1: first argument of the gamma prior on kernel scales.
lb = 2: second argument of the gamma prior on kernel scales.
pbetaa = 1: first argument of the beta prior on plambda.
pbetab = 1: second argument of the beta prior on plambda.
n: number of training samples, automatically generates.
p: number of explainatory variables, automatically generates.
meanJ: the expected number of kernels, automatically generates.
tune A list of tuning parameters, not expected to change.
lstep: the stepsize of the lognormal random walk on lambda.
frequL: the frequency to update L.
dpow: the power on the death step.
upow: the power on the update step.
varphistep: the stepsize of the lognormal random walk on varphi.
phistep: the stepsize of the lognormal random walk on phi.
theta A list of the starting values for the parameter theta, use the defaults if nothing is given.

Details

BARK is an Bayesian MCMC method. At each MCMC interation, we produce a draw from the joint posterior distribution, i.e. a full configuration of regression coefficents, kernel locations and kernel parameters etc.

Thus, unlike a lot of other modelling methods in R, we do not produce a single model object from which fits and summaries may be extracted. The output consists of values f*(x) (and sigma* in the numeric case) where * denotes a particular draw. The x is either a row from the training data (x.train) or the test data (x.test).

Value

bark returns a list, including:

fixed Fixed hyperparameters.
tune Tuning parameters used.
theta.last The last set of parameters from the posterior draw.
theta.nvec A matrix with nrow(x.train)+1 rows and (nkeep) columns, recording the number of kernels at each training sample.
theta.varphi A matrix with nrow(x.train)+1 rows and (nkeep) columns, recording the precision in the normal gamma prior distribution for the regression coefficients.
theta.beta A matrix with nrow(x.train)+1 rows and (nkeep) columns, recording the regression coefficients.
theta.lambda A matrix with ncol(x.train) rows and (nkeep) columns, recording the kernel scale parameters.
thea.phi The vector of length nkeep, recording the precision in regression Gaussian noise (1 for the classification case).
yhat.train A matrix with nrow(x.train) rows and (nkeep) columns. Each column corresponds to a draw f* from the posterior of f and each row corresponds to a row of x.train. The (i,j) value is f*(x) for the j^th kept draw of f and the i^th row of x.train.
For classification problems, this is the value of the expectation for the underlying normal random variable.
Burn-in is dropped.
yhat.test Same as yhat.train but now the x's are the rows of the test data.
yhat.train.mean train data fits = row mean of yhat.train.
yhat.test.mean test data fits = row mean of yhat.test.

Note

The code is not efficient in terms of calculating the kernel matrix in the updating. This is expected to be fixed in the next version.

References

Ouyang, Zhi (2008) Bayesian Additive Regression Kernels. Duke University. Ph.D. dissertation, page 58.
at: http://stat.duke.edu/people/theses/OuyangZ.html

Examples

##Simulate regression example
# Friedman 2 data set, 200 noisy training, 1000 noise free testing
# Out of sample MSE in SVM (default RBF): 6500 (sd. 1600)
# Out of sample MSE in BART (default):    5300 (sd. 1000)
traindata <- sim.Friedman2(200, sd=125)
testdata <- sim.Friedman2(1000, sd=0)
fit.bark.d <- bark(traindata$x, traindata$y, testdata$x, classification=FALSE, type="d")
boxplot(as.data.frame(fit.bark.d$theta.lambda))
mean((fit.bark.d$yhat.test.mean-testdata$y)^2)

##Simulate classification example
# Circle 5 with 2 signals and three noisy dimensions
# Out of sample erorr rate in SVM (default RBF): 0.110 (sd. 0.02)
# Out of sample error rate in BART (default):    0.065 (sd. 0.02)
traindata <- sim.Circle(200, dim=5)
testdata <- sim.Circle(1000, dim=5)
fit.bark.se <- bark(traindata$x, traindata$y, testdata$x, classification=TRUE, type="se")
boxplot(as.data.frame(fit.bark.se$theta.lambda))
mean((fit.bark.se$yhat.test.mean>0)!=testdata$y)


[Package bark version 0.1-0 Index]