ssden {gss}R Documentation

Estimating Probability Density Using Smoothing Splines

Description

Estimate probability densities using smoothing spline ANOVA models with cubic spline, linear spline, or thin-plate spline marginals for numerical variables. The symbolic model specification via formula follows the same rules as in lm, but with the response missing.

Usage

ssden(formula, type="cubic", data=list(), alpha=1.4, weights=NULL,
      subset, na.action=na.omit, id.basis=NULL, nbasis=NULL, seed=NULL,
      domain=as.list(NULL), quadrature=NULL, ext=.05, order=2,
      prec=1e-7, maxiter=30)

Arguments

formula Symbolic description of the model to be fit.
type Type of numerical marginals to be used. Supported are type="cubic" for cubic spline marginals, type="linear" for linear spline marginals, and type="tp" for thin-plate spline marginals.
data Optional data frame containing the variables in the model.
alpha Parameter defining cross-validation score for smoothing parameter selection.
weights Optional vector of bin-counts for histogram data.
subset Optional vector specifying a subset of observations to be used in the fitting process.
na.action Function which indicates what should happen when the data contain NAs.
id.basis Index of observations to be used as "knots."
nbasis Number of "knots" to be used. Ignored when id.basis is specified.
seed Seed to be used for the random generation of "knots." Ignored when id.basis is specified.
domain Data frame specifying marginal support of density.
quadrature Quadrature for calculating integral. Mandatory for type="tp".
ext For cubic spline and linear spline marginals, this option specifies how far to extend the domain beyond the minimum and the maximum as a percentage of the range. The default ext=.05 specifies marginal domains of lengths 110 percent of their respective ranges. Evaluation outside of the domain will result in an error. Ignored if type="tp" or domain are specified.
order For thin-plate spline marginals, this option specifies the order of the marginal penalties. Ignored if type="cubic" or type="linear" are specified.
prec Precision requirement for internal iterations.
maxiter Maximum number of iterations allowed for internal iterations.

Details

The model specification via formula is for the log density. For example, ~x1*x2 prescribes a model of the form

log f(x1,x2) = g_{1}(x1) + g_{2}(x2) + g_{12}(x1,x2) + C

with the terms denoted by "x1", "x2", and "x1:x2"; the constant is determined by the fact that a density integrates to one.

The selective term elimination may characterize (conditional) independence structures between variables. For example, ~x1*x2+x1*x3 yields the conditional independence of x2 and x3 given x1. Currently, up to four variables are supported.

Parallel to those in a ssanova object, the model terms are sums of unpenalized and penalized terms. Attached to every penalized term there is a smoothing parameter, and the model complexity is largely determined by the number of smoothing parameters.

The selection of smoothing parameters is through a cross-validation mechanism described in the references, with a parameter alpha; alpha=1 is "unbiased" for the minimization of Kullback-Leibler loss but may yield severe undersmoothing, whereas larger alpha yields smoother estimates.

A subset of the observations are selected as "knots." Unless specified via id.basis or nbasis, the subset size is determined by max(30,10n^(2/9)), which is appropriate for type="cubic" but not necessarily for type="linear" or type="tp".

Value

ssden returns a list object of class "ssden".
dssden and cdssden can be used to evaluate the estimated joint density and conditional density. pssden, qssden, cpssden, and cqssden can be used to evaluate (conditional) cdf and quantiles.

Note

For type="cubic" and type="linear", the quadrature will be generated if not provided by the user. The default quadrature in 1-D is the 200-point Gauss-Legendre formula on the domain. The default quadratures on 2-D, 3-D, and 4-D cubes are selected delayed Smolyak cubatures with 449, 2527, and 13697 points, on properly scaled product domains. See gauss.quad and smolyak.quad.

Author(s)

Chong Gu, chong@stat.purdue.edu

References

Gu, C. and Wang, J. (2003), Penalized likelihood density estimation: Direct cross-validation and scalable approximation. Statistica Sinica, 13, 811–826.

Gu, C. (2002), Smoothing Spline ANOVA Models. New York: Springer-Verlag.

See Also

dssden and cdssden.

Examples

## 1-D estimate: Buffalo snowfall
data(buffalo)
buff.fit <- ssden(~buffalo,domain=data.frame(buffalo=c(0,150)))
plot(xx<-seq(0,150,len=101),dssden(buff.fit,xx),type="l")
plot(xx,pssden(buff.fit,xx),type="l")
plot(qq<-seq(0,1,len=51),qssden(buff.fit,qq),type="l")
## Clean up
## Not run: 
rm(buffalo,buff.fit,xx,qq)
dev.off()
## End(Not run)

## 2-D with triangular domain: AIDS incubation
data(aids)
## rectangular quadrature
quad.pt <- expand.grid(incu=((1:40)-.5)/40*100,infe=((1:40)-.5)/40*100)
quad.pt <- quad.pt[quad.pt$incu<=quad.pt$infe,]
quad.wt <- rep(1,nrow(quad.pt))
quad.wt[quad.pt$incu==quad.pt$infe] <- .5
quad.wt <- quad.wt/sum(quad.wt)*5e3
## additive model (pre-truncation independence)
aids.fit <- ssden(~incu+infe,data=aids,subset=age>=60,
                  domain=data.frame(incu=c(0,100),infe=c(0,100)),
                  quad=list(pt=quad.pt,wt=quad.wt))
## conditional (marginal) density of infe
jk <- cdssden(aids.fit,xx<-seq(0,100,len=51),data.frame(incu=50))
plot(xx,jk$pdf,type="l")
## conditional (marginal) quantiles of infe (TIME-CONSUMING)
## Not run: 
cqssden(aids.fit,c(.05,.25,.5,.75,.95),data.frame(incu=50),jk$int)
## End(Not run)
## Clean up
## Not run: 
rm(aids,quad.pt,quad.wt,aids.fit,jk,xx)
dev.off()
## End(Not run)

[Package gss version 0.9-3 Index]