November 14, 2014

Bioconductor Project

  • Bioconductor: http://bioconductor.org/
  • 900+ packages for biostatistics
  • Developed by top biostatistics researchers
  • Extensive documentation: emphasizes reproducible research

Installation

  • Install R (I also recommend R Studio)
  • To install or upgrade bioconductor run:

    source("http://bioconductor.org/biocLite.R")
    biocLite()

    you only need to run these commands once, not every session.

Installing Packages

For this workshop we'll need several packages, not all are included in the basic bioconductor installation:

biocLite("limma")
biocLite( "pathview")
biocLite(c("breastCancerMAINZ", "hgu133a.db"))

Loading Packages

In each R session where you will work with bioconductor, you need to load the packages. I'll load in some publicly available data on breast cancer.

library(breastCancerMAINZ)
data(mainz)
library(Biobase)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

Examining mainz Data

Experimental Conditions

Each array is labeled with phenotype data, in this case it includes many variables:

colnames(pData(mainz))
##  [1] "samplename"    "dataset"       "series"        "id"           
##  [5] "filename"      "size"          "age"           "er"           
##  [9] "grade"         "pgr"           "her2"          "brca.mutation"
## [13] "e.dmfs"        "t.dmfs"        "node"          "t.rfs"        
## [17] "e.rfs"         "treatment"     "tissue"        "t.os"         
## [21] "e.os"

Plotting Array Data

Here is the expression of one gene in all 200 arrays.

row = 21399
plot(exprs(mainz)[row,])

Selecting Experiments

Let's build a factor to separate the patients that developed distant metastases within 5 years.

mainz.class <- (pData(mainz)$e.dmfs & pData(mainz)$t.dmfs < 5*365)
mainz.fac <- factor(mainz.class, labels= c("CLEAR","DM"))
sum(mainz.class)
## [1] 28

Detecting Differential Expression

stripchart(exprs(mainz)[row,] ~ mainz.fac, method="jitter")

The limma Package

library(limma)
## 
## Attaching package: 'limma'
## 
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA
?limma

More genes

Each array has 22,000+ genes.

plotMA(mainz)

Fitting a Linear Model

Instead of running 22,000+ t-tests, we can fit a linear model. The limma User's Guide has many more examples.

design <- model.matrix(~0 + mainz.fac)
colnames(design) <- c("CLEAR", "DM")
fit <- lmFit(exprs(mainz), design)
cont.matrix <- makeContrasts(DMvsCLEAR = DM - CLEAR, levels = design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit.b <- eBayes(fit2)

Plotting the Model

The plot of the fit object shows the size of the predicted effect on each gene.

plotMA(fit.b)

Volcano Plot

The volcanoplot shows the log odds of differential expression vs the log ratio of expression.

volcanoplot(fit.b)