featureSignif {feature} | R Documentation |
Identify significant features of kernel density estimates of 1- to 4-dimensional data. This feature has an interactive and a non-interactive mode.
featureSignif(x, bw, xlab, ylab, zlab, xlim, ylim, zlim, addData=FALSE, scaleData=FALSE, addDataNum=1000, addKDE=TRUE, signifLevel=0.05, plotFS=TRUE, addAxes3d=TRUE, addSignifGradRegion=FALSE, addSignifGradData=FALSE, addSignifCurvRegion=FALSE, addSignifCurvData=FALSE, plotSiZer=FALSE, logbwSiZer=TRUE, bwSiZer, densCol, dataCol="black", gradCol="green4", curvCol="blue", axisCol="black", bgCol="white", jitterRug=TRUE, dataAlpha=0.1, gradDataAlpha=0.3, gradRegionAlpha=0.2, curvDataAlpha=0.3, curvRegionAlpha=0.3, gridsize, gridsizeSiZer)
x |
data matrix |
bw |
bandwidth(s) - see below for details on how to specify bandwidths |
xlim, ylim, zlim |
x-, y-, z-axis limits |
xlab, ylab, zlab |
x-, y-, z-axis labels |
scaleData |
flag for scaling the data i.e. transforming to unit variance for each dimension. Default is FALSE. |
addData |
flag for display of the data. Default is FALSE. |
addDataNum |
maximum number of data points plotted in displays. Default is 1000. |
addKDE |
flag for display of kernel density estimates. Default is TRUE. Not available for 4-d data. |
plotFS |
flag for plotting. Default is TRUE. |
jitterRug |
flag for jittering of rug-plot for univariate data display. Default is TRUE. |
addSignifGradRegion |
flag for display of significant gradient regions. Default is FALSE. Not available for 4-d data. |
addSignifGradData |
flag for display of significant gradient data points. Default is FALSE. |
addSignifCurvRegion |
flag for display of significant curvature regions. Default is FALSE. Not available for 4-d data. |
addSignifCurvData |
flag for display of significant curvature data points. Default is FALSE. |
plotSiZer |
flag for displaying 1-d gradient SiZer map. Default is FALSE. |
logbwSiZer |
flag for displaying log bandwidths in SiZer map. Default is TRUE. |
bwSiZer |
range of bandwidths for SiZer map. |
addAxes3d |
flag for displaying axes in 3-d displays. Default is TRUE. |
signifLevel |
significance level. Default is 0.05. |
densCol |
colour of density estimate curve. Default for 1-d data "DarkOrange", for 2-d data is heat.colors(1000), for 3-d data is heat.colors(5). |
dataCol |
colour of data points. Default is "black". |
gradCol |
colour of significant gradient regions/points. Default is "green4". |
curvCol |
colour of significant curvature regions/points. Default is "blue". |
axisCol |
colour of axes. Default is "black". |
bgCol |
colour of background. Default is "white". |
dataAlpha |
alpha-blending transparency value for data points. |
gradDataAlpha |
alpha-blending transparency value for significant gradient data points. |
gradRegionAlpha |
alpha-blending transparency value for significant gradient regions. |
curvDataAlpha |
alpha-blending transparency value for significant curvature data points. |
curvRegionAlpha |
alpha-blending transparency value for significant curvature regions. |
gridsize |
vector of the number of grid points in each direction. |
gridsizeSiZer |
number of x- and axis grid points for SiZer map. Default is c(101,101). |
Feature significance is based on significance testing of the gradient (first derivative) and curvature (second derivative) of a kernel density estimate. This was developed for 1-d data by Chaudhuri & Marron (1995), for 2-d data by Godtliebsen, Marron & Chaudhuri (1999), and for 3-d and 4-d data by Duong, Cowling, Koch & Wand (2007).
The test statistic for gradient testing is at a point x is
W(x) = || hat{grad f}(x; H)||^2
where hat{grad f}(x; H) is kernel estimate of the gradient of f(x) with bandwidth H, and ||.|| is the Euclidean norm. W(x) is approximately chi-squared distributed with d degrees of freedom where d is the dimension of the data.
The analogous test statistic for curvature is
W2(x) = ||vech hat{curv f}(x; H)||^2
where hat{curv f}(x; H) is the kernel estimate of the curvature of f(x), and vech is the vector-half operator. W2(x) is approx. chi-squared distributed with d(d+1)/2 degrees of freedom.
Since this is a situation with many dependent hypothesis tests, we use a multiple comparison or simultaneous test to control the overall level of significance. We use a Hochberg-type procedure. See Hochberg (1988) and Duong, Cowling, Koch & Wand (2007).
If bw
is not specified, then a range of possible bandwidths is
automatically calculated.
For univariate data, bw
can be either a scalar or a
vector. With the former, a
KDE is computed with this scalar bandwidth. The latter is interpreted
as a range of bandwidths.
For multivariate data, bw
can either be a vector or a matrix.
With the former, a KDE is computed with this vector bandwidth. The
latter is interpreted as a range of bandwidths with the first row are
the minimum values and the second row the maximum values.
If a range of bandwidths is supplied, it goes into interactive mode.
If a single bandwidth is supplied, it goes into non-interactive mode.
Returns a list with the following fields
x
- data matrix
bw
- vector of bandwidths
fhat
- kernel density estimate on a grid (output from
drvkde
)
grad
- logical array (grid) indicating significant gradient
curv
- logical array (grid) indicating significant curvature
gradData
, gradDataPoints
- logical vector indicating and data points with significant gradient
curvData
- logical vector indicating and data points with significant curvature
In the interactive case, the return values are
based on the last bandwidths chosen before the interactive session
was ended. In the non-interactive case, the return values are based on the
specified bandwidth.
For 1-d data, the gradient SiZer map of Chaudhuri & Marron (1999) is
implemented. If this option is selected, it automatically goes into
non-interactive mode. The horizontal axis is the data axis, the
vertical axis are the bandwidths. It returns a list with the following
fields
x.grid
- vector of grid points
bw
- vector of bandwidths at grid points
SiZer
- matrix (rows = grid points, columns = bandwidths) for
SiZer map: 3 = decreasing gradient (red),
2 = increasing gradient (blue), 1 = zero gradient (purple), 0 =
sparse region (grey).
Chaudhuri, P. and Marron, J.S. (1999) SiZer for exploration of structures in curves. Journal of the American Statistical Association, 94, 807-823.
Duong, T., Cowling, A., Koch, I., Wand, M.P. (2007) Feature significance for multivariate kernel density estimation. Computational Statistics and Data Analysis, In press.
Hochberg, Y. (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika, 75, 800-802.
Godtliebsen, F., Marron, J.S. and Chaudhuri, P. (2002) Significance in scale space for bivariate density estimation. Journal of Computational and Graphical Statistics, 11, 1-22.
Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing Chapman and Hall.
## Univariate example data(earthquake) eq3 <- -log10(-earthquake[,3]) featureSignif(eq3, addSignifGradRegion=TRUE,xlab="-log(-depth)", bw=0.1) featureSignif(eq3, plotSiZer=TRUE, xlab="-log(-depth)") featureSignif(eq3, plotSiZer=TRUE, xlab="-log(-depth)", bwSiZer=c(0.05, 1.3), gridsizeSiZer=c(401, 101)) ## Bivariate example library(MASS) data(geyser) fs <- featureSignif(geyser, addSignifCurvRegion=TRUE, bw=c(4.5, 0.37), plotFS=FALSE) plot(fs, addSignifCurvRegion=TRUE) ## Trivariate example data(earthquake) earthquake[,3] <- -log10(-earthquake[,3]) featureSignif(earthquake, scaleData=TRUE, addData=TRUE, bw=c(0.0381, 0.0381, 0.0442)) featureSignif(earthquake, addKDE=FALSE, scaleData=TRUE, addSignifGradRegion=TRUE, addSignifCurvRegion=TRUE, bw=c(0.0381, 0.0381, 0.0442), xlim=c(0.4,0.5), ylim=c(0.4,0.5), zlim=c(0.8,0.9)) ## Not run: ## Interactive examples library(MASS) data(geyser) duration <- geyser$duration ## Univariate example featureSignif(duration) featureSignif(duration, addData=TRUE) featureSignif(duration, addSignifGradRegion=TRUE, addSignifGradData=TRUE) featureSignif(duration, addSignifCurvRegion=TRUE, addSignifCurvData=TRUE) ## Bivariate example featureSignif(geyser, addData=TRUE, addSignifGradRegion=TRUE, addSignifGradData=TRUE, bw=rbind(c(1, 0.1), c(5, 0.9))) ## bandwidths ranges: h1 in c(1, 5), h2 in c(0.1, 0.9) ## Trivariate example data(earthquake) earthquake$depth <- -log10(-earthquake$depth) featureSignif(earthquake, addSignifGradRegion=TRUE, scaleData=TRUE) ## End(Not run)