yai {yaImpute} | R Documentation |
Given a set of observations, yai
1) separates the observations
into reference and target observations, 2) applies the
specified method to project X-variables into a Euclidean space (not
always, see argument method
), and 3) finds the k-nearest
neighbors within the referenece observations and between the reference
and target observations. An alternative method using randomForest
classification and regression trees is provided for steps 2 and 3.
Target observations are those with values for X-variables and
not for Y-variables, while reference observations are those
with no missing values for X-and Y-variables (see Details for the
exception).
yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE, nVec=NULL,pVal=.05,method="msn",ann=TRUE,mtry=NULL,ntree=500, rfMode="buildClasses")
x |
1) a matrix or data frame containing the X-variables for all
observations with row names are the identification for the observations, or 2) a
one-sided formula defining the X-variables as a linear formula. If
a formula is coded for x , one must be used for y as well, if
needed. |
y |
1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula. |
data |
when x and y are formulas, then data is a data frame or
matrix that contains all the variables with row names are the identification for the observations.
The observations are split by yai into two sets. |
k |
the number of nearest neighbors; default is 1. |
noTrgs |
when TRUE, skip finding neighbors for target observations. |
noRefs |
when TRUE, skip finding neighbors for reference observations. |
nVec |
number of canonical vectors to use (methods msn and msn2 ),
or number of independent of X-variables reference data when method
mahalanobis . When NULL, the number is set by the function. |
pVal |
significant level for canonical vectors, used when method is
msn or msn2 . |
method |
is the strategy finding neighbors; the
options are the quoted key words (see details):
|
ann |
TRUE if ann is used to find neighbors, FALSE if a slow search is used. |
mtry |
the number of X-variables picked at random when method is randomForest ,
see randomForest , default is sqrt(number of X-variables). |
ntree |
the number of classification and regression trees when method is randomForest .
When more than one Y-variable is used, the trees are divided among the variables.
Alternatively, ntree can be a vector of values corresponding to each Y-variable. |
rfMode |
when buildClasses and method is randomForest , continuous variables
are internally converted to classes forcing randomForest to build classification trees for
the variable. Otherwise, regression trees are built if your version of
randomForest is newer than 4.5-18 . |
See the paper at http://www.jstatsoft.org/v23/i10 (it includes examples).
The following information is in addition to the content in the papers.
You need not have any Y-variables to run yai for the following methods:
euclidean
, raw
, mahalanobis
, ica
, and
randomForest
(in which case unsupervised classification is
performed). However, normally yai
classifies reference
observations as those with no missing values for X- and Y- variables and
target observations are those with values for X- variables and
missing data for Y-variables. When Y is NULL (there are no Y-variables),
all the observations are considered references. See
newtargets
for an example of how to use yai in this
situation.
An object of class yai
, which is a list with
the following tags:
call |
the call. |
yRefs, xRefs |
matrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes. |
obsDropped |
a list of the row names for observations dropped for various reasons (missing data). |
trgRows |
a list of the row names for target observations as a subset of all observations. |
xall |
the X-variables for all observations. |
cancor |
returned from cancor function when method msn or
msn2 is used (NULL otherwise). |
ccaVegan |
an object of class cca (from package vegan) when method gnn is used. |
ftest |
a list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise). |
yScale, xScale |
scale data used on yRefs and xRefs as needed. |
k |
the value of k. |
pVal |
as input; only used when method msn or msn2 is used. |
projector |
NULL when not used. For methods msn, msn2, gnn and mahalanobis, this is a matrix that projects normalized X-variables into a space suitable for doing Eculidian distances. |
nVec |
number of canonical vectors used (methods msn and msn2 ),
or number of independent X-variables in the reference data when method
mahalanobis is used. |
method |
as input, the method used. |
ranForest |
a list of the forests if method randomForest is used. There is
one forest for each Y-variable, or just one forest when there are no
Y-variables. |
ICA |
a list of information from fastICA
when method ica is used. |
ann |
the value of ann, TRUE when ann is used, FALSE otherwise. |
xlevels |
NULL if no factors are used as predictors; otherwise a list
of predictors that have factors and their levels (see lm ). |
neiDstTrgs |
a data frame of distances between a target (identified by its row name) and the k references. There are k columns. |
neiIdsTrgs |
a data frame of reference identifications that correspond to neiDstTrgs. |
neiDstRefs, neiIdsRefs |
counterparts for references. |
Nicholas L. Crookston ncrookston@fs.fed.us
Andrew O. Finley finleya@msu.edu
require (yaImpute) data(iris) # set the random number seed so that example results are consistant # normally, leave out this command set.seed(12345) # form some test data, y's are defined only for reference # observations. refs=sample(rownames(iris),50) x <- iris[,1:2] # Sepal.Length Sepal.Width y <- iris[refs,3:4] # Petal.Length Petal.Width # build yai objects using 2 methods msn <- yai(x=x,y=y) mal <- yai(x=x,y=y,method="mahalanobis") # running the following examples will load packages vegan # and randomForest, and is more complicated. data(MoscowMtStJoe) # convert polar slope and aspect measurements to cartesian # (which is the same as Stage's (1976) transformation). polar <- MoscowMtStJoe[,40:41] polar[,1] <- polar[,1]*.01 # slope proportion polar[,2] <- polar[,2]*(pi/180) # aspect radians cartesian <- t(apply(polar,1,function (x) {return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) })) colnames(cartesian) <- c("xSlAsp","ySlAsp") x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64]) y <- MoscowMtStJoe[,1:35] mal <- yai(x=x, y=y, method="mahalanobis", k=1) gnn <- yai(x=x, y=y, method="gnn", k=1) msn <- yai(x=x, y=y, method="msn", k=1) plot(mal,vars=yvars(mal)[1:16]) # reduce the plant community data for randomForest. yba <- MoscowMtStJoe[,1:17] ybaB <- whatsMax(yba,nbig=7) # see help on whatsMax rf <- yai(x=x, y=ybaB, method="randomForest", k=1) # build the imputations for the original y's rforig <- impute(rf,ancillaryData=y) # compare the results compare.yai(mal,gnn,msn,rforig) plot(compare.yai(mal,gnn,msn,rforig)) # build another randomForest case forcing regression # to be used for continuous variables. The answers differ # but one is set not clearly better than the other. rf2 <- yai(x=x, y=ybaB, method="randomForest", rfMode="regression") rforig2 <- impute(rf2,ancillaryData=y) compare.yai(rforig2,rforig)