decHeur {pcalg} | R Documentation |
Simple Heuristic for deciding whether robust PC-algorithm should be used.
decHeur(dat,gam=0.05,sim.method="t",est.method="o",n.sim=100,two.sided=FALSE,verbose=FALSE)
dat |
Data matrix (cols=variables, rows=samples) |
gam |
Significance level for test |
sim.method |
Reference distribution; "n" for Normal, "t" for N+10% t3 |
est.method |
Estimation method of correlation matrix; "s" for standard, "o" for OGK using Qn (robust) |
n.sim |
Number of samples drawn from reference distribution |
two.sided |
Should a two-sided test be used? |
verbose |
Run in verbose mode |
Simulation studies show that the standard PC-algorithm already is rather insensitive to outliers, provided, they are not too severe. The effect of very heavy outliers can be dramatically reduced by using the robust PC-algorithm; this increases the computational burden by roughly one order of magnitude.
We provide a simple method for deciding whether data at hand has worse outliers than a given reference distribution. Using this, we see two heuristics for deciding whether to use the robust version of the PC-algorithm or not. On the one hand, one could use the normal distribution as reference distribution and apply the robust PC-algorithm to all data that seem to contain more outliers than an appropriate normal distribution (Heuristic A). On the other hand, one could, inspired by the results of simulation studies, only want to apply the robust method in the case where the contamination is worse than a normal distribution with 10% outliers from a $t_3$ distribution. Then, we would use a normal distribution with 10% outliers from a $t_3$ distribution as reference distribution (Heuristic B).
In order to decide whether data has worse outliers than a given reference distribution, we proceed as follows. We compute a robust estimate of the covariance matrix of the data (e.g. OGK with Qn-estimator) and simulate (several times) data from the reference distribution with this covariance matrix. For each dimension i (1 <=q i <=q p), we compute the ratio of standard deviation σ_i and a robust version of it $s_i$ (e.g., Qn-estimator) and compute the average over all dimensions. (Since the main input for the PC-algorithm are correlation estimates which can be expressed in terms of scale estimates, we base our test statistics on scale estimates.) Thus, we obtain the distribution of this averaged ratio R= frac{1}{p} sum_{i=1}^{p} σ_i / s_i under the null hypothesis that the data can be explained by the reference distribution with given covariance matrix. We now can test this null hypothesis by using the ratio computed with the current data set r=frac{1}{p} sum_{i=1}^{p} hat{σ_i}/hat{s_i} on a given significance level.
tvec |
Simulated values of test statistic |
tval |
Observed value of test statistic |
outlier |
Is robust method suggested? (TRUE=Suggested) |
Markus Kalisch (kalisch@stat.math.ethz.ch)
pcAlgo
which can be used with standard and robust
correlation estimate.
set.seed(123) ## generate a data set for the example p <- 5 myDAG <- randomDAG(p, prob = 0.6) n <- 1000 ## data without outlier datN <- rmvDAG(n, myDAG, errDist = "normal") ## data with severe outlier (10% Cauchy) datC <- rmvDAG(n, myDAG, errDist = "mix") n.sim <- 20 gam <- 0.05 sim.method <- "t" est.method <- "o" decHeur(datN,gam,sim.method,est.method,n.sim=n.sim,two.sided=FALSE,verbose=TRUE) decHeur(datC,gam,sim.method,est.method,n.sim=n.sim,two.sided=FALSE,verbose=TRUE)