np {np} | R Documentation |
This package provides a variety of nonparametric and semiparametric kernel methods that seamlessly handle a mix of continuous, unordered, and ordered factor datatypes (unordered and ordered factors are often referred to as `nominal' and `ordinal' categorical variables respectively).
Bandwidth selection is a key aspect of sound nonparametric and
semiparametric kernel estimation. np
is designed from the
ground up to make bandwidth selection the focus of attention. To this
end, one typically begins by creating a `bandwidth object' which
embodies all aspects of the method, including specific kernel
functions, data names, datatypes, and the like. One then passes these
bandwidth objects to other functions, and those functions can grab the
specifics from the bandwidth object thereby removing potential
inconsistencies and unnecessary repetition.
There are two ways in which you can interact with functions in
np
, either i) using dataframes, or ii) using a formula
interface, where appropriate.
To some, it may be natural to use the dataframe interface. The R
data.frame
function preserves a variable's type once it
has been cast (unlike cbind
, which we avoid for this
reason). If you find this most natural for your project, you first
create a dataframe casting data according to their type (i.e., one of
continuous (default), factor
, ordered
)
Then you would simply pass this dataframe to the appropriate np
function, for example npudensbw(dat=data)
.
To others, however, it may be natural to use the formula interface
that is used for the regression examples, among others. For
nonparametric regression functions such as npreg
, you
would proceed as you would using lm
(e.g., bw <-
npregbw(y~factor(x1)+x2))
except that you would of course not need to
specify, e.g., polynomials in variables, interaction terms, or create
a number of dummy variables for a factor. Every function in np
supports both interfaces, where appropriate.
Note that if your factor is in fact a character string such as, say,
X
being either "MALE"
or "FEMALE"
, np will handle
this directly, i.e., there is no need to map the string values into
unique integers such as (0,1). Once the user casts a variable as a
particular datatype (i.e., factor
,
ordered
, or continuous (default)), all subsequent
methods automatically detect the type and use the appropriate kernel
function and method where appropriate.
All estimation methods are fully multivariate, i.e., there are no limitations on the number of variables one can model (or number of observations for that matter). Execution time for most routines is, however, exponentially increasing in the number of observations and increases with the number of variables involved.
Nonparametric methods include unconditional density (distribution), conditional density (distribution), regression, mode, and quantile estimators along with gradients where appropriate, while semiparametric methods include single index, partially linear, and smooth (i.e., varying) coefficient models.
A number of tests are included such as consistent specification tests for parametric regression and quantile regression models along with tests of significance for nonparametric regression.
A variety of bootstrap methods for computing standard errors, nonparametric confidence bounds, and bias-corrected bounds are implemented.
A variety of bandwidth methods are implemented including fixed, nearest-neighbor, and adaptive nearest-neighbor.
A variety of data-driven methods of bandwidth selection are implemented, while the user can specify their own bandwidths should they so choose (either a raw bandwidth or scaling factor).
A flexible plotting utility, npplot
, facilitates
graphing of multivariate objects. An example for creating postscript
graphs using the npplot
utility and pulling this into a
LaTeX document is provided.
The function npksum
allows users to create or implement
their own kernel estimators or tests should they so desire.
The underlying functions are written in C for computational
efficiency. Despite this, due to their nature, data-driven bandwidth
selection methods involving multivariate numerical search can be
time-consuming, particularly for large datasets. A version of this
package using the Rmpi
wrapper is under development that allows
one to deploy this software in a clustered computing environment to
facilitate computation involving large datasets.
To cite the np
package, type citation("np")
from within
R
for details.
The kernel methods in np
employ the so-called `generalized
product kernels' found in Hall, Racine, and Li (2004), Li and Racine
(2003), Li and Racine (2004), Li and Racine (2007), Ouyang, Li,
and Racine (2006), and Racine and Li (2004), among others. For
details on a particular method, kindly refer to the original
references listed above.
We briefly describe the particulars of various univariate kernels used
to generate the generalized product kernels that underlie the
kernel estimators implemented in the np
package. In a nutshell,
the generalized kernel functions that underlie the kernel estimators
in np
are formed by taking the product of univariate kernels
such as those listed below. When you cast your data as a particular
type (continuous, factor, or ordered factor) in a data frame or
formula, the routines will automatically recognize the type of
variable being modelled and use the appropriate kernel type for each
variable in the resulting estimator.
So, if you had two variables, x1[i] and
x2[i], and x1[i] was continuous while
x2[i] was, say, binary (0/1), and you created a data
frame of the form X <- data.frame(x1,factor(x2))
, then the
kernel function used by np
would be
K(.)=k(.)*l(.) where the
particular kernel functions k(.) and
l(.) would be, say, the second order Gaussian
(ckertype="gaussian"
) and Aitchison and Aitken
(ukertype="aitchisonaitken"
) kernels by default, respectively.
Note that higher order continuous kernels (i.e., fourth, sixth, and eighth order) are derived from the second order kernels given above (see Li and Racine (2007) for details).
For particulars on any given method, kindly see the references listed for the method in question.
Tristen Hayfield <hayfield@phys.ethz.ch>, Jeffrey S. Racine <racinej@mcmaster.ca>
Maintainer: Jeffrey S. Racine <racinej@mcmaster.ca>
We are grateful to John Fox and Achim Zeleis for their valuable input and encouragement. We would like to gratefully acknowledge support from the Natural Sciences and Engineering Research Council of Canada (NSERC:www.nserc.ca), the Social Sciences and Humanities Research Council of Canada (SSHRC:www.sshrc.ca), and the Shared Hierarchical Academic Research Computing Network (SHARCNET:www.sharcnet.ca)
Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method,” Biometrika, 63, 413-420.
Hall, P. and J.S. Racine and Q. Li (2004), “Cross-validation and the estimation of conditional probability densities,” Journal of the American Statistical Association, 99, 1015-1026.
Li, Q. and J.S. Racine (2003), “Nonparametric estimation of distributions with categorical and continuous data,” Journal of Multivariate Analysis, 86, 266-292.
Li, Q. and J.S. Racine (2004), “Cross-validated local linear nonparametric regression,” Statistica Sinica, 14, 485-512.
Ouyang, D. and Q. Li and J.S. Racine (2006), “Cross-validation and the estimation of probability distributions with categorical data,” Journal of Nonparametric Statistics, 18, 69-100.
Racine, J.S. and Q. Li (2004), “Nonparametric estimation of regression functions with both categorical and continuous Data,” Journal of Econometrics, 119, 99-130.
Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.
Pagan, A. and A. Ullah (1999), Nonparametric Econometrics, Cambridge University Press.
Scott, D.W. (1992), Multivariate Density Estimation. Theory, Practice and Visualization, New York: Wiley.
Silverman, B.W. (1986), Density Estimation, London: Chapman and Hall.
Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions,” Biometrika, 68, 301-309.