EB.Anova {SharedHT2}R Documentation

Per Gene Group Comparison with Empirical Bayes Anova Tests

Description

EB.Anova calculates per gene empirical Bayes Anova statistic for testing a variety hypotheses in replicated array experiments involving group comparisons (Izmirlian & Xu, see the manuscript in the ./doc directory). Also appropriate for time course data. The Empirical Bayes statistic is calculated using the ordinary group means and a shrinkage variance estimate formed as the posterior mean of the variance. If Var.Struc is set to "general" then this is the posterior mean of the variance/covariance matrix under the Wishart/Inverse Wishart Bayesian model. If Var.Struct is set to "simple" then this is the posterior mean of the normalized within group sum of squares under a Chi-squared/Inverse Gamma Bayesian model. In both cases, the paremeters of the prior distribution are fit using MLE applied to the per gene residual squared error (scalar or matrix corresponding to the above).

Usage


  EB.Anova(data, labels, H0 = "equal.means", Var.Struct = "general",
           verbose = TRUE, subset, theta0 = NULL, gradient = FALSE, 
           fit.only = FALSE, na.action = na.pass)

Arguments

data a data frame containing the logged (base 2 or base 10) expression values in all arrays from the experiment. By default the variables are taken from the environment which EB.Anova is called from. Variable names should be chosen to be internally consistent in some searchable way. For example, if you have d=2 experimental groups (say treatment one versus control and treatment two versus control), and n=3 replicates in each group, you might choose names like: log2.grp1.n1, log2.grp2.n1, log2.grp1.n2, log2.grp2.n2, log2.grp1.n3, log2.grp2.n3 Notice that order that the names occur is irrelevent. In time course data the time point is the grouping variable. The rows should be named using the gene identifiers.
labels A character vector containing the group names, these being fragments of the variable names in the data argument supplied. In the example above, labels = c("log2.grp1", "log2.grp2")
H0 can be either a character string giving the form of the null hypothesis that is to be tested. Specifically, if Var.Struct="general" and (i) if #reps > #groups then the H0="zero.means" null may be tested (ii) if #reps > #groups - 1 then the H0="equal.means" null may be tested. (iii) if #reps >= 2 then the H0="no.trend" null may be tested. Alternatively, H0 may be a user specified contrasts matrix having #groups columns and of rank less than #replicates. When Var.Struct="simple" then any of the above may be tested as long as $n>2$.
Var.Struct set to either "general" or "simple". The default, "general", fits the Wishart/Inverse Wishart model and computes per gene Empirical Bayes Hotelling T-Squared tests. The "simple" option assumes equal group variances, fits the Chi-Squared/Inverse Gamma model and computes per gene Empirical Bayes F-tests (or Univariate T-Squared)
verbose Do you want a trace of the optimization procedure. Set to TRUE by default.
subset an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)
theta0 optional values for the starting parameters. Must be of length d*(d+1)/2 + 1
gradient set to TRUE to optimize using methods requiring analytic derivatives. Set to FALSE by default. The Nelder Meade converges from any starting position and fits in less than ten seconds on a pentium 4. The likelihood surface is a giant spike since there is such an abundance of data on the error structure.
fit.only set to TRUE if you only want the result of the model fit and not the list of per gene statistics. Set to FALSE by default.
na.action set to na.pass if you want NA's to be treated as missing at random. This works as long as all genes have the minimum of replicates required for the particular null hypothesis specified

Value

An object of class fit.n.data containing two components:

data A data frame containing the per gene test statistics (both the empirical Bayes and the standard versions) together with corresponding p-values.
EBfit An object of class EBfit containing the results of the fitted model. Type ?EBfit for details.

Note

Under the model assumptions, the test statistic has an F distribution with r degrees in top and nu + n - 2*r - 2 degrees in bottom. where r = d for the "zero.means" test, r = d-1 for the "equal.means" test, and r = 1 for the "no.trend" test. The test performs quite well, even the asymptotic p-values make sense and are 'FDR'-able under a variety of departures from the model. It is entirely coded in C. In an experiment with N = 12625 genes, d=2 groups with n=3 replicates per group, the model was fit and the list of statistics was computed in less than 10 seconds on a pentium 4.

Author(s)

Grant Izmirlian izmirlian@nih.gov

References

Izmirlian, G and Xu, J.-L. (2002), The Shrinkage Variance Hotelling T-Squared Test for Genomic Profiling Studies, NCI technical report.

See Also

EB.Anova, EBfit, SimAffyDat, TopGenes, SimNorm.IG, SimMVN.IW, SimMVN.mxIW, SimOneNorm.IG, SimOneMVN.IW, SimOneMVN.mxIW

Examples


# The included example dataset is a simulated Affymetrix oligonucleotide
# array experiment. Type ?SimAffyDat for details.

  data(SimAffyDat)

## Not run: 
# If the two bioconductor packages, "affy" and "hgu95av2" are
# installed, replace the above line with

  data(SimAffyDat.ann)
  SimAffyDat <- SimAffyDat.ann

# In general if you have a replicated microarray experiment in
# "MyMicroArrayData" and the corresponding bioconductor annotation
# package is "hguFOOBAR" then, after making sure that packages
# "affy" and "hguFOOBAR" are installed, the enhanced functionality
# is turned on by adding an attribute to your dataframe as follows:

  attr(MyMicroArrayData, "annotation") <- "hguFOOBAR"

## End(Not run)

# Fit the Wishart/Inverse Wishart empirical Bayes model and derive per gene
# Shared Variance Hotelling T-Squared (ShHT2) statistics.

  fit.SimAffyDat <- EB.Anova(data=SimAffyDat, labels=c("log2.grp" %,% (1:2)),
                             H0="zero.means", Var.Struct = "general")

# Top 20 genes (sorted by decreasing ShHT2 statistic) and model summary

  fit.SimAffyDat

# Same screen output & opens html browser with genelist linked to GeneCards database.
# Type ?TopGenes for help

# Note:  part of the 'enhanced functionality' is floating gene names
# over the links to the gene identifiers but there is more ....
# see the help under TopGenes (well ... more on that later...)

  print(fit.SimAffyDat, browse = TRUE)

# Only the genes selected by the Benjamini-Hochberg procedure at FDR=0.05

  print(fit.SimAffyDat, FDR=0.05, allsig=TRUE)

# Just the top 35 genes

  print(fit.SimAffyDat, n.g = 35)

# In the previous two cases, supplying the argument 'browse'=TRUE produces
# the expected result. If just the genelist without the model summary is desired
# then use calls to 'TopGenes' instead of calls to 'print' in the above
# with exactly the same sytax otherwise.

# Try the update method with Var.Struct="simple":

  fitSV.SimAffyDat <- update(fit.SimAffyDat, Var.Struct = "simple")

# If for some reason, you want the 'EBfit' component then use either

  x <- fit.SimAffyDat$EBfit   # or
  x <- EBfit(fit.SimAffyDat)

# The 'EBfit' print method supplies the simple summary table of coefficients,
# standard errors and Wald statistic p-values mentioned above.

  x

# Notice that the actual structure is more detailed:

  names(x)
  x$log.likelihood
  x$variance
 
# You can perform non-assignment operations directly on the data component
# of the object using 'as.data.frame', which contains the statistics,
# unsorted in the  same order as the original dataset, using 'as.data.frame':

  as.data.frame(fit.SimAffyDat)[1:100, ]


[Package SharedHT2 version 2.0 Index]