Abstract

The bioconductor project publishes software and data for the analysis of functional genomics experiments, using a variety of techniques such as microarrays or second-generation sequencing. We will explore the bioconductor tools for expression analysis and pathway analysis and apply them to a demo dataset.

Installing bioconductor

Bioconductor comes with it’s own installation procedures for software and data packages, the biocLite() function. You can use it to install any bioconductor package, along with all prerequisites. For example, to install the limma package, we can use code like this:

source("http://bioconductor.org/biocLite.R")
biocLite("limma")

Installing Data packages

Bioconductor includes many experimental data packages as well, that can be installed with biocLite() function as well:

biocLite("breastCancerMAINZ")
biocLite("hgu133a.db")

Loading packages

Once packages have been installed, you need to load the package into each session that will use them. R-Studio will keep the packages loaded into a workspace, even if you exit and restart the program, as long as you save your workspace.

library(Biobase)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
##     mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist, unsplit
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
library(limma)
## 
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
library(breastCancerMAINZ)
library(hgu133a.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
## 
## 
data(mainz)

Getting help.

bioconductor has extensive help available for almost all aspects.

?mainz
?ExpressionSet
?limma

Of particular note, are the package vignettes, pdf files with tutorial examples of almost all packages.

Examining the mainz breast cancer experiment

http://cancerres.aacrjournals.org/content/68/13/5405.abstract

dim(mainz)
## Features  Samples 
##    22283      200
colnames(pData(mainz))
##  [1] "samplename"    "dataset"       "series"        "id"           
##  [5] "filename"      "size"          "age"           "er"           
##  [9] "grade"         "pgr"           "her2"          "brca.mutation"
## [13] "e.dmfs"        "t.dmfs"        "node"          "t.rfs"        
## [17] "e.rfs"         "treatment"     "tissue"        "t.os"         
## [21] "e.os"
annotation(mainz)
## [1] "hgu133a"

Plot some data from mainz

I’ll pick a row from the 22,000 probes in the arrays and plot the expression in each array for that gene.

row = 21399
plot(exprs(mainz)[row,])

Let’s divide the arrays into two groups, patients that developed distant metastases within 5 years of diagnosis and those that did not. We can then examine the expression of the selected gene in these groups.

mainz.class <- (pData(mainz)$e.dmfs & pData(mainz)$t.dmfs < 5*365)
mainz.fac <- factor(mainz.class, labels= c("CLEAR","DM"))
stripchart(exprs(mainz)[row,] ~ mainz.fac, method="jitter")

It seems that there is some separation between the groups, but since there is some variance, we can’t tell if the group means are different or not.

You already learned a tool to test if two groups have the same mean or a different mean. Let’s use t.test to examine the data.

t.test(exprs(mainz)[row,] ~ mainz.fac)
## 
##  Welch Two Sample t-test
## 
## data:  exprs(mainz)[row, ] by mainz.fac
## t = -4.963, df = 34.472, p-value = 1.858e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4433663 -0.6050152
## sample estimates:
## mean in group CLEAR    mean in group DM 
##            7.951027            8.975217

Plot all the genes

We don’t have just one gene, we have 22,000 probes, and 200 samples.

Here is an MA plot of all 22,000+ probes in the first of the 200 experiments compared against the average expression in all the experiments.

plotMA(mainz)

Fitting a linear model

Suppose we want to find genes that change expression between the cases that have distant metastases in under 5 years from those that don’t. Instead of running t-tests on 22,000+ genes, we can fit a linear model to the data.

The design matrix we use uses the factor we constructed above to divide the arrays into two groups. We can then construct a contrast matrix to compare expression in the patients with distant metastases to those that did not develop metastases.

The limma User’s Guide contains many case studies with different types of microarray experiments, it is very helpful when designing your own analyses.

design <- model.matrix(~0 + mainz.fac)
colnames(design) <- c("CLEAR", "DM")
fit <- lmFit(exprs(mainz), design)
cont.matrix <- makeContrasts(DMvsCLEAR = DM - CLEAR, levels = design)
fit2 <- contrasts.fit(fit, cont.matrix)

The principal contribution of limma is a process for moderating the t-statistic, using information from all the genes to get a better estimate of the standard error and degrees of freedom of each gene.

fit.b <- eBayes(fit2)

Reporting the results

The resulting fit.b object has the results of the linear model fit, and we can produce plots and table summarizing the evidence for differential expression between the conditions.

The plot of the fit object shows the size of the predicted effect on each gene.

plotMA(fit.b)

The volcanoplot shows the log odds of differential expression vs the log ratio of expression.

volcanoplot(fit.b)

The topTable lists the genes ranked by the evidence of differential expression.

topTable(fit.b, adjust = "BH")
##                  logFC   AveExpr         t      P.Value   adj.P.Val
## 222039_at    1.0241907  8.094413  5.387081 1.988696e-07 0.002835900
## 218009_s_at  0.9425805  9.172245  5.300403 3.023342e-07 0.002835900
## 57703_at     0.4721307  7.167348  5.215717 4.531487e-07 0.002835900
## 221760_at   -0.9685860  9.995876 -5.149831 6.188802e-07 0.002835900
## 200830_at    0.3632375 11.382618  5.143923 6.363370e-07 0.002835900
## 204156_at   -0.8234357  7.522004 -5.072697 8.882609e-07 0.003298853
## 202412_s_at  0.5866401  8.633824  4.984981 1.333395e-06 0.003808274
## 204825_at    1.1068133  8.883956  4.974329 1.400342e-06 0.003808274
## 213007_at    0.6347787  8.397143  4.931885 1.700912e-06 0.003808274
## 203799_at   -0.6801316 10.016417 -4.907973 1.896838e-06 0.003808274
##                    B
## 222039_at   6.737055
## 218009_s_at 6.358004
## 57703_at    5.992056
## 221760_at   5.710387
## 200830_at   5.685258
## 204156_at   5.384055
## 202412_s_at 5.017503
## 204825_at   4.973321
## 213007_at   4.797999
## 203799_at   4.699734

Bibliography

citation()
## 
## To cite R in publications use:
## 
##   R Core Team (2015). R: A language and environment for
##   statistical computing. R Foundation for Statistical Computing,
##   Vienna, Austria. URL https://www.R-project.org/.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {R: A Language and Environment for Statistical Computing},
##     author = {{R Core Team}},
##     organization = {R Foundation for Statistical Computing},
##     address = {Vienna, Austria},
##     year = {2015},
##     url = {https://www.R-project.org/},
##   }
## 
## We have invested a lot of time and effort in creating R, please
## cite it when using it for data analysis. See also
## 'citation("pkgname")' for citing R packages.
citation("breastCancerMAINZ")
## 
## To cite package 'breastCancerMAINZ' in publications use:
## 
##   Markus Schroeder, Benjamin Haibe-Kains, Aedin Culhane, Christos
##   Sotiriou, Gianluca Bontempi and John Quackenbush (2011).
##   breastCancerMAINZ: Gene expression dataset published by Schmidt
##   et al. [2008] (MAINZ).. R package version 1.8.0.
##   http://compbio.dfci.harvard.edu/
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {breastCancerMAINZ: Gene expression dataset published by Schmidt et al. [2008]
## (MAINZ).},
##     author = {Markus Schroeder and Benjamin Haibe-Kains and Aedin Culhane and Christos Sotiriou and Gianluca Bontempi and John Quackenbush},
##     year = {2011},
##     note = {R package version 1.8.0},
##     url = {http://compbio.dfci.harvard.edu/},
##   }
## 
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
citation("hgu133a.db")
## Warning in citation("hgu133a.db"): no date field in DESCRIPTION file of
## package 'hgu133a.db'
## 
## To cite package 'hgu133a.db' in publications use:
## 
##   Marc Carlson (). hgu133a.db: Affymetrix Human Genome U133 Set
##   annotation data (chip hgu133a). R package version 3.2.2.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {hgu133a.db: Affymetrix Human Genome U133 Set annotation data (chip hgu133a)},
##     author = {Marc Carlson},
##     note = {R package version 3.2.2},
##   }
## 
## ATTENTION: This citation information has been auto-generated from
## the package DESCRIPTION file and may need manual editing, see
## 'help("citation")'.
citation("Biobase")
## 
##   Orchestrating high-throughput genomic analysis with
##   Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan
##   Nature Methods, 2015:12, 115.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     author = {{Huber} and {W.} and {Carey} and V. J. and {Gentleman} and {R.} and {Anders} and {S.} and {Carlson} and {M.} and {Carvalho} and B. S. and {Bravo} and H. C. and {Davis} and {S.} and {Gatto} and {L.} and {Girke} and {T.} and {Gottardo} and {R.} and {Hahne} and {F.} and {Hansen} and K. D. and {Irizarry} and R. A. and {Lawrence} and {M.} and {Love} and M. I. and {MacDonald} and {J.} and {Obenchain} and {V.} and {{Ole's}} and A. K. and {{Pag`es}} and {H.} and {Reyes} and {A.} and {Shannon} and {P.} and {Smyth} and G. K. and {Tenenbaum} and {D.} and {Waldron} and {L.} and {Morgan} and {M.}},
##     title = {{O}rchestrating high-throughput genomic analysis with {B}ioconductor},
##     journal = {Nature Methods},
##     year = {2015},
##     volume = {12},
##     number = {2},
##     pages = {115--121},
##     url = {http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html},
##   }
citation("limma")
## 
## Please cite the paper below for the limma software itself.  Please
## also try to cite the appropriate methodology articles that
## describe the statistical methods implemented in limma, depending
## on which limma functions you are using.  The methodology articles
## are listed in Section 2.1 of the limma User's Guide.
## 
##   Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W.,
##   and Smyth, G.K. (2015). limma powers differential expression
##   analyses for RNA-sequencing and microarray studies. Nucleic
##   Acids Research 43(7), e47.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Article{,
##     author = {Matthew E Ritchie and Belinda Phipson and Di Wu and Yifang Hu and Charity W Law and Wei Shi and Gordon K Smyth},
##     title = {{limma} powers differential expression analyses for {RNA}-sequencing and microarray studies},
##     journal = {Nucleic Acids Research},
##     year = {2015},
##     volume = {43},
##     number = {7},
##     pages = {e47},
##   }