featureSignif {feature} | R Documentation |
Identify significant features of kernel density estimates of 1- to 4-dimensional data. This feature has an interactive and a non-interactive mode.
featureSignif(x, bw, xlab, ylab, zlab, xlim, ylim, zlim, addData=FALSE, scaleData=FALSE, addDataNum=1000, addKDE=TRUE, jitterRug=TRUE, signifLevel=0.05, addSignifGradRegion=FALSE, addSignifGradData=FALSE, addSignifCurvRegion=FALSE, addSignifCurvData=FALSE, plotSiZer=FALSE, logbwSiZer=TRUE, addAxes3d=TRUE, densCol, dataCol="black", gradCol="green4", curvCol="blue", axisCol="black", bgCol="white", dataAlpha=0.1, gradDataAlpha=0.3, gradRegionAlpha=0.2, curvDataAlpha=0.3, curvRegionAlpha=0.3, gridsize, gridsizeSiZer)
x |
data matrix |
bw |
bandwidth(s) - see below for details on how to specify bandwidths |
xlim, ylim, zlim |
x-, y-, z-axis limits |
xlab, ylab, zlab |
x-, y-, z-axis labels |
scaleData |
flag for scaling the data i.e. transforming to unit variance for each dimension. Default is FALSE. |
addData |
flag for display of the data. Default is FALSE. |
addDataNum |
maximum number of data points plotted in displays. Default is 1000. |
addKDE |
flag for display of kernel density estimates. Default is TRUE. Not available for 4-d data. |
jitterRug |
flag for jittering of rug-plot for univariate data display. Default is TRUE. |
addSignifGradRegion |
flag for display of significant gradient regions. Default is FALSE. Not available for 4-d data. |
addSignifGradData |
flag for display of significant gradient data points. Default is FALSE. |
addSignifCurvRegion |
flag for display of significant curvature regions. Default is FALSE. Not available for 4-d data. |
addSignifCurvData |
flag for display of significant curvature data points. Default is FALSE. |
plotSiZer |
flag for displaying 1-d gradient SiZer map. Default is FALSE. |
logbwSiZer |
flag for displaying log bandwidths in SiZer map. Default is TRUE. |
addAxes3d |
flag for displaying axes in 3-d displays. Default is TRUE. |
signifLevel |
significance level. Default is 0.05. |
densCol |
colour of density estimate curve. Default for 1-d data "DarkOrange", for 2-d data is heat.colors(1000), for 3-d data is heat.colors(5). |
dataCol |
colour of data points. Default is "black". |
gradCol |
colour of significant gradient regions/points. Default is "green4". |
curvCol |
colour of significant curvature regions/points. Default is "blue". |
axisCol |
colour of axes. Default is "black". |
bgCol |
colour of background. Default is "white". |
dataAlpha |
alpha-blending transparency value for data points. |
gradDataAlpha |
alpha-blending transparency value for significant gradient data points. |
gradRegionAlpha |
alpha-blending transparency value for significant gradient regions. |
curvDataAlpha |
alpha-blending transparency value for significant curvature data points. |
curvRegionAlpha |
alpha-blending transparency value for significant curvature regions. |
gridsize |
vector of the number of grid points in each direction. |
gridsizeSiZer |
number of x-axis grid points for SiZer map. Default is 101. |
Feature significance is based on significance testing of the gradient (first derivative) and curvature (second derivative) of a kernel density estimate. This was developed for 1-d data by Chaudhuri & Marron (1995), for 2-d data by Godtliebsen, Marron & Chaudhuri (1999), and for 3-d and 4-d data by Duong, Cowling, Koch & Wand (2007).
The test statistic for gradient testing is at a point x is
W(x) = || hat{grad f}(x; H)||^2
where hat{grad f}(x; H) is kernel estimate of the gradient of f(x) with bandwidth H, and ||.|| is the Euclidean norm. W(x) is approximately chi-squared distributed with d degrees of freedom where d is the dimension of the data.
The test statistic for curvature is analogous to that for gradient testing:
W2(x) = ||vech hat{curv f}(x; H)||^2
where hat{curv f}(x; H) is the kernel estimate of the curvature of f(x), and vech is the vector-half operator. W2(x) is approx. chi-squared distributed with d(d+1)/2 degrees of freedom.
Since this is a situation with many dependent hypothesis tests, we use a multiple comparison or simultaneous test to control the overall level of significance. We use a Hochberg-type procedure. See Hochberg (1988) and Duong, Cowling, Koch & Wand (2007).
If bw
is not specified, then a range of possible bandwidths is
automatically calculated.
For univariate data, bw
can be either a scalar or a
vector. With the former, a
KDE is computed with this scalar bandwidth. The latter is interpreted
as a range of bandwidths.
For multivariate data, bw
can either be a vector or a matrix.
With the former, a KDE is computed with this vector bandwidth. The
latter is interpreted as a range of bandwidths with the first row are
the minimum values and the second row the maximum values.
If a range of bandwidths is supplied, it goes into interactive mode.
If a single bandwidth is supplied, it goes into non-interactive mode.
Returns a list with the following fields
x
- data matrix
bw
- vector of bandwidths
fhat
- kernel density estimate on a grid (output from
drvkde
)
grad
- logical array indicating significant gradient (if
addSignigGradRegion=TRUE
)
curv
- logical array indicating significant curvature (if
addSignigCurvRegion=TRUE
)
gradData
- logical vector indicating significant gradient (if
addSignigGradData=TRUE
)
curvData
- logical vector indicating significant curvature (if
addSignigCurvData=TRUE
)
In the interactive case, the return values are
based on the last bandwidths chosen before the interactive session
was ended. In the non-interactive case, the return values are based on the
specified bandwidth.
For 1-d data, the gradient SiZer map of Chaudhuri & Marron (1999) is
implemented. If this option is selected, it automatically goes into
non-interactive mode. The horizontal axis is the data axis, the
vertical axis are the bandwidths. It returns a list with the following
fields
x.grid
- vector of grid points
bw
- vector of bandwidths at grid points
SiZer
- matrix (rows = grid points, columns = bandwidths) for
SiZer map: 3 = decreasing gradient (red),
2 = increasing gradient (blue), 1 = zero gradient (purple), 0 =
sparse region (grey).
Chaudhuri, P. and Marron, J.S. (1999) SiZer for exploration of structures in curves. Journal of the American Statistical Association, 94, 807-823.
Duong, T., Cowling, A., Koch, I., Wand, M.P. (2007) Feature significance for multivariate kernel density estimation. Submitted.
Hochberg, Y. (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800-802.
Godtliebsen, F., Marron, J.S. and Chaudhuri, P. (2002) Significance in scale space for bivariate density estimation. Journal of Computational and Graphical Statistics, 11, 1-22.
Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing Chapman and Hall.
bkde
(in package `KernSmooth'),
bkde2D
(in package `KernSmooth'),
density
## Non-interactive examples ## Univariate example data(earthquake) eq3 <- -log10(-earthquake[,3]) featureSignif(eq3, addSignifGradRegion=TRUE,xlab="-log(-depth)", bw=0.1) featureSignif(eq3, plotSiZer=TRUE, xlab="-log(-depth)") ## Bivariate example library(MASS) data(geyser) fs <- featureSignif(geyser, addSignifGradRegion=TRUE, addSignifCurvRegion=TRUE, bw=c(4.5, 0.37)) names(fs) ## Not run: ## Trivariate example data(earthquake) earthquake[,3] <- -log10(-earthquake[,3]) featureSignif(earthquake, scaleData=TRUE, addData=TRUE, bw=c(0.0381, 0.0381, 0.0442)) featureSignif(earthquake, addKDE=FALSE, scaleData=TRUE, addSignifGradRegion=TRUE, addSignifCurvRegion=TRUE, bw=c(0.0381, 0.0381, 0.0442), xlim=c(0.4,0.5), ylim=c(0.4,0.5), zlim=c(0.8,0.9)) ## End(Not run) ## Not run: ## Interactive examples library(MASS) data(geyser) duration <- geyser$duration ## Univariate example featureSignif(duration) featureSignif(duration, addData=TRUE) featureSignif(duration, addSignifGradRegion=TRUE, addSignifGradData=TRUE) featureSignif(duration, addSignifCurvRegion=TRUE, addSignifCurvData=TRUE) ## Bivariate example featureSignif(geyser, addData=TRUE, addSignifGradRegion=TRUE, addSignifGradData=TRUE, bw=rbind(c(1, 0.1), c(5, 0.9))) ## bandwidths ranges: h1 in c(1, 5), h2 in c(0.1, 0.9) ## Trivariate example data(earthquake) earthquake$depth <- -log10(-earthquake$depth) featureSignif(earthquake, addSignifGradRegion=TRUE, scaleData=TRUE) ## End(Not run)