maxsel.test {exactmaxsel} | R Documentation |
The function maxsel.test
computes the probability that the maximally
selected criterion is <= than the value observed
from the data, under the null-hypothesis of no association between X and
Y, given the numbers of observations with Y=0,Y=1,X=1,...,X=K.
The candidate binary splits over which the criterion is maximized
depend on type
(see details). If p denotes the output of the function
maxsel.test
, 1-p may be seen as the p-value of an independence test.
maxsel.test(x, y=NULL, type, statistic)
x |
a numeric vector of length n giving the values of the variable
X for the considered n observations. The
classes must be coded as 1,...,K. Alternatively, x can be a 2 x K matrix corresponding
to a contingency table,
where the two rows are for the values of Y (Y=0,1) and the K columns are for the values
of X (X=1,...,K). In this case, y must be set to y=NULL .
|
y |
a numeric vector of length n giving the class (response variable Y) of the considered observations. The
classes must be coded as 0 and 1. If x is a contingency table, y must be set
to y=NULL . |
type |
the type of the considered binary splits. type="ord"
corresponds to an ordinal X variable, type="cat"
corresponds to a categorical X variable with unordered categories, type="ord2" corresponds
to an ordinal X variable with 2 cutpoints (non-monotonous association). |
statistic |
the association measure used as criterion to select the
best split. Currently, only statistic="chi2" (chi-square statistic)
and statistic="gini" (the Gini-gain from machine learning) are
implemented. |
For example, let us consider a variable X with the possible values {1,2,3,4}.
If type="ord"
, the set of candidate splits consists of {1}{2,3,4}, {1,2}{3,4} and {1}{2,3,4}.
If type="cat"
, the set of candidate splits consists of {1}{2,3,4}, {1,2}{3,4},
{1,2,3}{4}, {1,2,4}{3}, {1,4}{2,3}, {1,3,4}{2}, {1,3}{2,4}.
If type="ord2"
, the set of candidate splits consists of {1}{2,3,4}, {1,2}{3,4},
{1,2,3}{4}, {1,2,4}{3}, {1,4}{2,3}, {1,3,4}{2}.
the probability that the maximally selected criterion is <= than the value observed
from the data, under the null-hypothesis of no association between x
and
y
, given the numbers of observations with Y=0,Y=1,X=1,...,X=K.
Anne-Laure Boulesteix (http://www.slcmsr.net/boulesteix)
A.-L. Boulesteix (2006), Maximally selected chi-square statistics for ordinal variables, Biometrical Journal 48:451-462.
A.-L. Boulesteix (2006), Maximally selected chi-square statistics and binary splits of nominal variables, Biometrical Journal 48:838-848.
C. Strobl, A.-L. Boulesteix and T. Augustin (2007), Unbiased split selection for classification trees based on the Gini index, Computational Statistics and Data Analysis (in press).
A.-L. Boulesteix and C. Strobl (2006), Maximally selected chi-square statistics and umbrella orderings, Computational Statistics and Data Analysis (in press).
# load exactmaxsel library library(exactmaxsel) # First case: x and y are data vectors # Simulate x and y x<-sample(4,30,replace=TRUE) y<-sample(c(0,1),30,replace=TRUE) maxsel.test(x=x,y=y,type="ord",statistic="chi2") maxsel.test(x=x,y=y,type="cat",statistic="gini") # Second case: x is a contingency table, y=NULL. x<-matrix(c(8,10,40,13,15,4),2,4,byrow=TRUE) maxsel.test(x=x,y=NULL,type="ord",statistic="chi2") maxsel.test(x=x,y=NULL,type="cat",statistic="gini")