ssden {gss} | R Documentation |
Estimate probability densities using smoothing spline ANOVA models.
The symbolic model specification via formula
follows the same
rules as in lm
, but with the response missing.
ssden(formula, type=NULL, data=list(), alpha=1.4, weights=NULL, subset, na.action=na.omit, id.basis=NULL, nbasis=NULL, seed=NULL, domain=as.list(NULL), quadrature=NULL, prec=1e-7, maxiter=30)
formula |
Symbolic description of the model to be fit. |
type |
List specifying the type of spline for each variable.
See mkterm for details. |
data |
Optional data frame containing the variables in the model. |
alpha |
Parameter defining cross-validation score for smoothing parameter selection. |
weights |
Optional vector of bin-counts for histogram data. |
subset |
Optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
Function which indicates what should happen when the data contain NAs. |
id.basis |
Index of observations to be used as "knots." |
nbasis |
Number of "knots" to be used. Ignored when
id.basis is specified. |
seed |
Seed to be used for the random generation of "knots."
Ignored when id.basis is specified. |
domain |
Data frame specifying marginal support of density. |
quadrature |
Quadrature for calculating integral. Mandatory if variables other than factors or numerical vectors are involved. |
prec |
Precision requirement for internal iterations. |
maxiter |
Maximum number of iterations allowed for internal iterations. |
The model specification via formula
is for the log density.
For example, ~x1*x2
prescribes a model of the form
log f(x1,x2) = g_{1}(x1) + g_{2}(x2) + g_{12}(x1,x2) + C
with the terms denoted by "x1"
, "x2"
, and
"x1:x2"
; the constant is determined by the fact that a
density integrates to one.
The selective term elimination may characterize (conditional)
independence structures between variables. For example,
~x1*x2+x1*x3
yields the conditional independence of x2 and x3
given x1.
Parallel to those in a ssanova
object, the model terms
are sums of unpenalized and penalized terms. Attached to every
penalized term there is a smoothing parameter, and the model
complexity is largely determined by the number of smoothing
parameters.
The selection of smoothing parameters is through a cross-validation
mechanism described in the references, with a parameter
alpha
; alpha=1
is "unbiased" for the minimization of
Kullback-Leibler loss but may yield severe undersmoothing, whereas
larger alpha
yields smoother estimates.
A subset of the observations are selected as "knots." Unless
specified via id.basis
or nbasis
, the number of
"knots" q is determined by max(30,10n^{2/9}), which is
appropriate for the default cubic splines for numerical vectors.
ssden
returns a list object of class "ssden"
.
dssden
and cdssden
can be used to
evaluate the estimated joint density and conditional density;
pssden
, qssden
, cpssden
,
and cqssden
can be used to evaluate (conditional) cdf
and quantiles. The method project.ssden
can be used
to calculate the Kullback-Leibler projection for model selection.
Default quadrature will be constructed for up to 4 numerical vectors
on a hyper cube, then outer product with factor levels will be taken
if factors are involved. The sides of the hyper cube are specified
by domain
; for domain$x
missing, the default is
c(min(x),max(x))+c(-1,1)*(max(x)-mimn(x))*.05
.
On a 1-D interval, the quadrature is the 200-point Gauss-Legendre
formula returned from gauss.quad
. For 2, 3, or 4
numerical vectors, delayed Smolyak cubatures from
smolyak.quad
with 449, 2527, and 13697 points are used
on cubes with the marginals properly transformed; see Gu and Wang
(2003) for the marginal transformations.
The results may vary from run to run. For consistency, specify
id.basis
or set seed
.
Chong Gu, chong@stat.purdue.edu
Gu, C. (2002), Smoothing Spline ANOVA Models. New York: Springer-Verlag.
Gu, C. and Wang, J. (2003), Penalized likelihood density estimation: Direct cross-validation and scalable approximation. Statistica Sinica, 13, 811–826.
## 1-D estimate: Buffalo snowfall data(buffalo) buff.fit <- ssden(~buffalo,domain=data.frame(buffalo=c(0,150))) plot(xx<-seq(0,150,len=101),dssden(buff.fit,xx),type="l") plot(xx,pssden(buff.fit,xx),type="l") plot(qq<-seq(0,1,len=51),qssden(buff.fit,qq),type="l") ## Clean up ## Not run: rm(buffalo,buff.fit,xx,qq) dev.off() ## End(Not run) ## 2-D with triangular domain: AIDS incubation data(aids) ## rectangular quadrature quad.pt <- expand.grid(incu=((1:40)-.5)/40*100,infe=((1:40)-.5)/40*100) quad.pt <- quad.pt[quad.pt$incu<=quad.pt$infe,] quad.wt <- rep(1,nrow(quad.pt)) quad.wt[quad.pt$incu==quad.pt$infe] <- .5 quad.wt <- quad.wt/sum(quad.wt)*5e3 ## additive model (pre-truncation independence) aids.fit <- ssden(~incu+infe,data=aids,subset=age>=60, domain=data.frame(incu=c(0,100),infe=c(0,100)), quad=list(pt=quad.pt,wt=quad.wt)) ## conditional (marginal) density of infe jk <- cdssden(aids.fit,xx<-seq(0,100,len=51),data.frame(incu=50)) plot(xx,jk$pdf,type="l") ## conditional (marginal) quantiles of infe (TIME-CONSUMING) ## Not run: cqssden(aids.fit,c(.05,.25,.5,.75,.95),data.frame(incu=50),jk$int) ## End(Not run) ## Clean up ## Not run: rm(aids,quad.pt,quad.wt,aids.fit,jk,xx) dev.off() ## End(Not run) ## One factor plus one vector data(gastric) gastric$trt fit <- ssden(~futime*trt,data=gastric) ## conditional density cdssden(fit,c("1","2"),cond=data.frame(futime=150)) ## conditional quantiles cqssden(fit,c(.05,.25,.5,.75,.95),data.frame(trt="1")) ## Clean up ## Not run: rm(gastric,fit)