yai {yaImpute}R Documentation

Find K nearest neighbors

Description

Given a set of observations, yai 1) separates the observations into reference and target observations, 2) applies the specified method to project X-variables into a Euclidean space (not always, see argument method), and 3) finds the k-nearest neighbors within the referenece observations and between the reference and target observations. An alternative method using randomForest classification and regression trees is provided for steps 2 and 3.

Usage

yai(x=NULL,y=NULL,data=NULL,k=1,noTrgs=FALSE,noRefs=FALSE,
    nVec=NULL,pVal=.05,method="msn",mtry=NULL,ntree=500,ann=TRUE)

Arguments

x 1) a matrix or data frame containing the X-variables for all observations. Row names are the identification for the observation, or 2) a one-sided formula defining the X-variables as a linear formula. If a formula is coded for x, one must be used for y as well, if needed.
y 1) a matrix or data frame containing the Y-variables for the reference observations, or 2) a one-sided formula defining the Y-variables as a linear formula.
data when x and y are formulas, then data is a data frame or matrix that contains all the variables. The observations are split by yai into two sets. Reference observations are those with no missing values for X- and Y-variables. Target observations are those with values for X-variables and NAs for Y-variables.
k the number of nearest neighbors; default is 1.
noTrgs when TRUE, skip finding neighbors for target observations.
noRefs when TRUE, skip finding neighbors for reference observations.
nVec number of canonical vectors to use (methods msn and msn2), or number of independent of X-variables reference data when method mahalanobis. When NULL, the number is set by the function.
pVal significant level for canonical vectors, used when method is msn or msn2.
method is the strategy finding neighbors; the options are the quoted key words (see details):
  • euclidean - distance is computed in a normalized X space.
  • raw - like euclidean, except no normalization is done.
  • mahalanobis - distance is computed in its namesakes space.
  • ica - like mahalanobis, but based on Independent Component Analysis using package fastICA.
  • msn - distance is computed in a projected canonical space.
  • msn2 - like msn, but with variance weighting (canonical regression rather than correlation).
  • gnn - distance is computed using a projected ordination of Xs found using canonical correspondence analysis (cca from package vegan).
  • randomForest - distance is one minus the proportion of randomForest trees where a target observation is in the same terminal node as a reference observation (see randomForest).
  • mtry the number of X-variables picked at random, see randomForest documentation, default is sqrt(number of X-variables).
    ntree the number of classification and regression trees in the randomForest. When more than one Y-variable is used, the trees are divided among the variables. Alternatively, ntree can be a vector of values corresponding to each Y-variable.
    ann TRUE if ann is used to find neighbors, FALSE if a slow search is used.

    Details

    See ./../doc/yaImputePaper.pdf or this alternate http://forest.moscowfsl.wsu.edu/gems/yaImputePaper.pdf

    Value

    An object of class yai, which is a list with the following tags:

    call the call.
    yRefs, xRefs matrices of the X- and Y-variables for just the reference observations (unscaled). The scale factors are attached as attributes.
    obsDropped a list of the row names for observations dropped for various reasons (missing data).
    trgRows a list of the row names for target observations as a subset of all observations.
    xall the X-variables for all observations.
    cancor returned from cancor function when method msn or msn2 is used (NULL otherwise).
    ccaVegan an object of class cca (from package vegan) when method gnn is used.
    ftest a list containing partial F statistics and a vector of Pr>F (pgf) corresponding to the canonical correlation coefficients when method msn or msn2 is used (NULL otherwise).
    yScale, xScale scale data used on yRefs and xRefs as needed.
    k the value of k.
    pVal as input; only used when method msn or msn2 is used.
    projector NULL when not used. For methods msn, msn2, gnn and mahalanobis, this is a matrix that projects normalized X-variables into a space suitable for doing Eculidian distances.
    nVec number of canonical vectors used (methods msn and msn2), or number of independent X-variables in the reference data when method mahalanobis is used.
    method as input, the method used.
    ranForest a list of the forests if method randomForest is used. There is one forest for each Y-variable, or just one forest when there are no Y-variables.
    ICA a list of information from fastICA when method ica is used.
    ann the value of ann, TRUE when ann is used, FALSE otherwise.
    xlevels NULL if no factors are used as predictors; otherwise a list of predictors that have factors and their levels (see lm).
    neiDstTrgs a data frame of distances between a target (identified by its row name) and the k references. There are k columns.
    neiIdsTrgs a data frame of reference identifications that correspond to neiDstTrgs.
    neiDstRefs, neiIdsRefs counterparts for references.

    Author(s)

    Nicholas L. Crookston ncrookston@fs.fed.us
    Andrew O. Finley afinley@stat.umn.edu

    Examples

    
    require (yaImpute)
    
    # running these examples will load packages vegan and randomForest
    
    data(MoscowMtStJoe)
    
    # convert polar slope and aspect measurements to cartesian
    # (which is the same as Stage's (1976) expression).
    
    polar <- MoscowMtStJoe[,40:41]
    polar[,1] <- polar[,1]*.01      # slope proportion
    polar[,2] <- polar[,2]*(pi/180) # aspect radians
    cartesian <- t(apply(polar,1,function (x)
                   {return (c(x[1]*cos(x[2]),x[1]*sin(x[2]))) }))
    colnames(cartesian) <- c("xSlAsp","ySlAsp")
    x <- cbind(MoscowMtStJoe[,37:39],cartesian,MoscowMtStJoe[,42:64])
    y <- MoscowMtStJoe[,1:35]
    
    mal <- yai(x=x, y=y, method="mahalanobis", k=1)
    gnn <- yai(x=x, y=y, method="gnn", k=1)
    msn <- yai(x=x, y=y, method="msn", k=1)
    
    plot(mal)
    
    # reduce the plant community data for randomForest.
    yba  <- MoscowMtStJoe[,1:17]
    ybaB <- whatsMax(yba,nbig=7)  # see help on whatsMax
    
    rf <- yai(x=x, y=ybaB, method="randomForest", k=1)
    
    # build the imputations for the original y's
    rforig <- impute(rf,ancillaryData=y)
    
    # compare the results
    compare.yai(mal,gnn,msn,rforig)
    plot(compare.yai(mal,gnn,msn,rforig))
    

    [Package yaImpute version 0.0-3 Index]