maxsel.test {exactmaxsel}R Documentation

Test of independence based on maximally selected statistics

Description

The function maxsel.test computes the probability that the maximally selected criterion is <= than the value observed from the data, under the null-hypothesis of no association between X and Y, given the numbers of observations with Y=0,Y=1,X=1,...,X=K. The candidate binary splits over which the criterion is maximized depend on type (see details). If p denotes the output of the function maxsel.test, 1-p may be seen as the p-value of an independence test.

Usage

maxsel.test(x, y=NULL, type, statistic)

Arguments

x a numeric vector of length n giving the values of the variable X for the considered n observations. The classes must be coded as 1,...,K. Alternatively, x can be a 2 x K matrix corresponding to a contingency table, where the two rows are for the values of Y (Y=0,1) and the K columns are for the values of X (X=1,...,K). In this case, y must be set to y=NULL.
y a numeric vector of length n giving the class (response variable Y) of the considered observations. The classes must be coded as 0 and 1. If x is a contingency table, y must be set to y=NULL.
type the type of the considered binary splits. type="ord" corresponds to an ordinal X variable, type="cat" corresponds to a categorical X variable with unordered categories, type="ord2" corresponds to an ordinal X variable with 2 cutpoints (non-monotonous association).
statistic the association measure used as criterion to select the best split. Currently, only statistic="chi2" (chi-square statistic) and statistic="gini" (the Gini-gain from machine learning) are implemented.

Details

For example, let us consider a variable X with the possible values {1,2,3,4}. If type="ord", the set of candidate splits consists of {1}{2,3,4}, {1,2}{3,4} and {1}{2,3,4}. If type="cat", the set of candidate splits consists of {1}{2,3,4}, {1,2}{3,4}, {1,2,3}{4}, {1,2,4}{3}, {1,4}{2,3}, {1,3,4}{2}, {1,3}{2,4}. If type="ord2", the set of candidate splits consists of {1}{2,3,4}, {1,2}{3,4}, {1,2,3}{4}, {1,2,4}{3}, {1,4}{2,3}, {1,3,4}{2}.

Value

the probability that the maximally selected criterion is <= than the value observed from the data, under the null-hypothesis of no association between x and y, given the numbers of observations with Y=0,Y=1,X=1,...,X=K.

Author(s)

Anne-Laure Boulesteix (http://www.slcmsr.net/boulesteix)

References

A.-L. Boulesteix (2006), Maximally selected chi-square statistics for ordinal variables, Biometrical Journal 48:451-462.

A.-L. Boulesteix (2006), Maximally selected chi-square statistics and binary splits of nominal variables, Biometrical Journal 48:838-848.

C. Strobl, A.-L. Boulesteix and T. Augustin (2007), Unbiased split selection for classification trees based on the Gini index, Computational Statistics and Data Analysis (in press).

A.-L. Boulesteix and C. Strobl (2006), Maximally selected chi-square statistics and umbrella orderings, Computational Statistics and Data Analysis (in press).

See Also

maxsel.

Examples

# load exactmaxsel library
library(exactmaxsel)

# First case: x and y are data vectors
# Simulate x and y
x<-sample(4,30,replace=TRUE)
y<-sample(c(0,1),30,replace=TRUE)

maxsel.test(x=x,y=y,type="ord",statistic="chi2")
maxsel.test(x=x,y=y,type="cat",statistic="gini")

# Second case: x is a contingency table, y=NULL.
x<-matrix(c(8,10,40,13,15,4),2,4,byrow=TRUE)
maxsel.test(x=x,y=NULL,type="ord",statistic="chi2")
maxsel.test(x=x,y=NULL,type="cat",statistic="gini")



[Package exactmaxsel version 1.0-2 Index]