pcSelect {pcalg} | R Documentation |
This function is intended for feature selection: If you have a response variable y and a data matrix dm, which columns are "strongly influential" on y. The type of influence is the same as in the PC-Algorithm, i.e., y and x (a column of dm) are associated if they are correlated even when conditioning on any subset of the remaining columns in dm. Therefore, only very strong relations will be found and the result is typically a subset of other feature selection techniques. Note that there are also robust correlation methods available which render this method robust.
pcSelect(y,dm, alpha, corMethod = "standard", verbose = 0, directed=FALSE)
y |
Response Vector (length(y)=nrow(dm)) |
dm |
Data matrix (rows: samples, cols: nodes) |
alpha |
Significance level of individual partial correlation tests |
corMethod |
"standard" or "Qn" for standard or robust correlation estimation |
verbose |
0-no output, 1-small output, 2-details (using 1 and 2 makes the function very much slower) |
directed |
Boolean; should the output graph be directed? |
This function basically applies pcAlgo
on the data
matrix obtained by joining y
and dm
. Since the output is
not concerned with the edges found within the columns of dm
,
the algorithm is adapted accordingly. Therefore, the runtime and the
ability to deal with large datasets is typically increased quite a lot.
G |
A boolean vector indicating which column of dm is
associated with y |
zMin |
The minimal z-values when testing partial correlations
between y and each column of dm . The larger the number,
the more consistent is the edge with the data. |
Markus Kalisch (kalisch@stat.math.ethz.ch) and Martin Maechler.
P. Spirtes, C. Glymour and R. Scheines (2000) Causation, Prediction, and Search, 2nd edition, The MIT Press.
pcAlgo
which is the more general version of this function.
p <- 10 ## generate and draw random DAG : set.seed(101) myDAG <- randomDAG(p, prob = 0.2) plot(myDAG, main = "randomDAG(10, prob = 0.2)") ## generate 1000 samples of DAG using standard normal error distribution n <- 1000 d.mat <- rmvDAG(n, myDAG, errDist = "normal") ## let's pretend that the 10th column is the response and the first 9 ## columns are explanatory variable. Which of the first 9 variables ## "cause" the tenth variable? y <- d.mat[,10] dm <- d.mat[,-10] pcSelect(d.mat[,10],d.mat[,-10],alpha=0.05) ## You see, that variable 4,5,6 are considered as important ## By inspecting zMin you can also see, that the influence of variable 6 ## is very evident from the data (zMin is 21.32, so quite large - as a ## rule ## of thumb for judging what is large, you could use quantiles of the ## Standard Normal Distribution) ## The result should be the same when using pcAlgo resU <- pcAlgo(d.mat, alpha = 0.05, corMethod = "standard",directed=TRUE) resU plot(resU,zvalue.lwd=TRUE) ## as can be seen, the pcAlgo function also finds 4,5,6 as the important ## variables ## Again, variable 6 seems to be very evident from the data