EB.Anova {SharedHT2} | R Documentation |
EB.Anova
calculates per gene empirical Bayes Anova
statistic for testing a variety hypotheses in replicated array
experiments involving group comparisons (Izmirlian & Xu, see
the manuscript in the ./doc directory). Also appropriate for time
course data. The Empirical Bayes statistic is calculated using the
ordinary group means and a shrinkage variance estimate formed as the
posterior mean of the variance. If Var.Struc
is set to
"general"
then this is the posterior mean of the variance/covariance
matrix under the Wishart/Inverse Wishart Bayesian model. If
Var.Struct
is set to "simple"
then this is the posterior
mean of the normalized within group sum of squares under a
Chi-squared/Inverse Gamma Bayesian model. In both cases, the
paremeters of the prior distribution are fit using MLE applied to
the per gene residual squared error (scalar or matrix corresponding
to the above).
EB.Anova(data, labels, H0 = "equal.means", Var.Struct = "general", verbose = TRUE, subset, theta0 = NULL, gradient = FALSE, fit.only = FALSE, na.action = na.pass)
data |
a data frame containing the logged (base 2 or
base 10) expression values in all arrays from the experiment.
By default the variables are taken from the environment which
EB.Anova is called from. Variable names should be
chosen to be internally consistent in some searchable way.
For example, if you have d =2 experimental groups (say
treatment one versus control and treatment two versus control),
and n =3 replicates in each group, you might choose names
like: log2.grp1.n1 , log2.grp2.n1 , log2.grp1.n2 ,
log2.grp2.n2 , log2.grp1.n3 , log2.grp2.n3
Notice that order that the names occur is irrelevent. In time
course data the time point is the grouping variable. The rows
should be named using the gene identifiers. |
labels |
A character vector containing the group names, these
being fragments of the variable names in the data argument
supplied. In the example above, labels = c("log2.grp1", "log2.grp2") |
H0 |
can be either a character string giving the form of the null
hypothesis that is to be tested. Specifically, if
Var.Struct="general" and (i) if #reps > #groups then the
H0="zero.means" null may be tested (ii) if #reps >
#groups - 1 then the H0="equal.means" null may be tested.
(iii) if #reps >= 2 then the H0="no.trend" null may be tested.
Alternatively, H0 may be a user specified contrasts matrix
having #groups columns and of rank less than #replicates. When
Var.Struct="simple" then any of the above may be tested
as long as $n>2$.
|
Var.Struct |
set to either "general" or "simple" .
The default, "general" , fits the Wishart/Inverse Wishart model
and computes per gene Empirical Bayes Hotelling T-Squared tests.
The "simple" option assumes equal group variances, fits the
Chi-Squared/Inverse Gamma model and computes per gene Empirical
Bayes F-tests (or Univariate T-Squared) |
verbose |
Do you want a trace of the optimization procedure. Set
to TRUE by default. |
subset |
an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.) |
theta0 |
optional values for the starting parameters. Must be of
length d*(d+1)/2 + 1 |
gradient |
set to TRUE to optimize using methods requiring
analytic derivatives. Set to FALSE by default. The Nelder
Meade converges from any starting position and fits in less than
ten seconds on a pentium 4. The likelihood surface is a giant spike
since there is such an abundance of data on the error structure. |
fit.only |
set to TRUE if you only want the result of the
model fit and not the list of per gene statistics. Set to
FALSE by default. |
na.action |
set to na.pass if you want NA 's to be
treated as missing at random. This works as long as all genes have
the minimum of replicates required for the particular null
hypothesis specified |
An object of class fit.n.data
containing two components:
data |
A data frame containing the per gene test statistics (both the empirical Bayes and the standard versions) together with corresponding p-values. |
EBfit |
An object of class EBfit containing the results
of the fitted model. Type ?EBfit for details. |
Under the model assumptions, the test statistic has an F distribution
with r
degrees in top and nu + n - 2*r - 2
degrees in bottom.
where r
= d
for the "zero.means"
test,
r
= d-1
for the "equal.means"
test, and
r
= 1
for the "no.trend"
test.
The test performs quite well, even the asymptotic p-values make
sense and are 'FDR'-able under a variety of departures from the model.
It is entirely coded in C. In an experiment with N
= 12625 genes,
d
=2 groups with n
=3 replicates per group, the model was
fit and the list of statistics was computed in less than 10 seconds on a
pentium 4.
Grant Izmirlian izmirlian@nih.gov
Izmirlian, G and Xu, J.-L. (2002), The Shrinkage Variance Hotelling T-Squared Test for Genomic Profiling Studies, NCI technical report.
EB.Anova
, EBfit
, SimAffyDat
,
TopGenes
, SimNorm.IG
,
SimMVN.IW
, SimMVN.mxIW
,
SimOneNorm.IG
, SimOneMVN.IW
,
SimOneMVN.mxIW
# The included example dataset is a simulated Affymetrix oligonucleotide # array experiment. Type ?SimAffyDat for details. data(SimAffyDat) ## Not run: # If the two bioconductor packages, "affy" and "hgu95av2" are # installed, replace the above line with data(SimAffyDat.ann) SimAffyDat <- SimAffyDat.ann # In general if you have a replicated microarray experiment in # "MyMicroArrayData" and the corresponding bioconductor annotation # package is "hguFOOBAR" then, after making sure that packages # "affy" and "hguFOOBAR" are installed, the enhanced functionality # is turned on by adding an attribute to your dataframe as follows: attr(MyMicroArrayData, "annotation") <- "hguFOOBAR" ## End(Not run) # Fit the Wishart/Inverse Wishart empirical Bayes model and derive per gene # Shared Variance Hotelling T-Squared (ShHT2) statistics. fit.SimAffyDat <- EB.Anova(data=SimAffyDat, labels=c("log2.grp" %,% (1:2)), H0="zero.means", Var.Struct = "general") # Top 20 genes (sorted by decreasing ShHT2 statistic) and model summary fit.SimAffyDat # Same screen output & opens html browser with genelist linked to GeneCards database. # Type ?TopGenes for help # Note: part of the 'enhanced functionality' is floating gene names # over the links to the gene identifiers but there is more .... # see the help under TopGenes (well ... more on that later...) print(fit.SimAffyDat, browse = TRUE) # Only the genes selected by the Benjamini-Hochberg procedure at FDR=0.05 print(fit.SimAffyDat, FDR=0.05, allsig=TRUE) # Just the top 35 genes print(fit.SimAffyDat, n.g = 35) # In the previous two cases, supplying the argument 'browse'=TRUE produces # the expected result. If just the genelist without the model summary is desired # then use calls to 'TopGenes' instead of calls to 'print' in the above # with exactly the same sytax otherwise. # Try the update method with Var.Struct="simple": fitSV.SimAffyDat <- update(fit.SimAffyDat, Var.Struct = "simple") # If for some reason, you want the 'EBfit' component then use either x <- fit.SimAffyDat$EBfit # or x <- EBfit(fit.SimAffyDat) # The 'EBfit' print method supplies the simple summary table of coefficients, # standard errors and Wald statistic p-values mentioned above. x # Notice that the actual structure is more detailed: names(x) x$log.likelihood x$variance # You can perform non-assignment operations directly on the data component # of the object using 'as.data.frame', which contains the statistics, # unsorted in the same order as the original dataset, using 'as.data.frame': as.data.frame(fit.SimAffyDat)[1:100, ]