Abstract

Org-mode is a text mode in the emacs editor. Recent versions of Org-mode include support for embedding code, data, and figures in a document. These features can be used to aid publication of research results. This document will show some simple examples of the available features.

Introduction

Emacs Org-mode is a set of tools for managing information in plain text files. Recent versions include Babel, a set of features for embedding active source code in a document. Babel draws inspiration from literate programming tools like noweb, but extends beyond this by providing support for reproducible research, like Sweave.

Org-mode’s reproducible research support works with multiple languages (even in the same document), so you can use a shell script to generate data, analize it in R, and plot it with python. Org-mode also supports export to many different formats, including HTML, LaTeX, and OpenOffice, so you can publish your research in the most convenient format.

This document will show some of the features of Org-mode by working through a simple sequence assembly problem.

Sequence Assembly

Sequence assembly is the task of reconstructing a complete sequence from a series of fragments or “reads”. I’ll use python to develop the example. You’ll need to set up python in Org-mode as described in the documentation for ob-doc-python. Let’s start by generating a random sequence to analyze.

Making a random sequence

from collections import deque
import random
random.seed(0)
myseq = [random.choice(['a', 'c', 'g', 't']) for i in range(1000)]
print myseq[:10]
['t', 't', 'c', 'c', 'g', 'c', 't', 'c', 'c', 'g']

Simulating reads

To simulate reads, pick random starting points, and slice out fixed length strings.

readlen = 25
starts = [random.randint(0, len(myseq)-readlen) for i in range(500)]
reads = [myseq[start:start+readlen] for start in  starts]
print reads[0]
>>> >>> ['t', 'g', 'g', 'c', 'a', 'g', 'c', 't', 'g', 'a', 't', 'a', 'a', 'c', 'a', 'a', 'c', 'g', 'c', 'c', 'c', 'c', 'c', 't', 'g']

Checking coverage

OK, so let’s have a look at our coverage.

results = "coverage.png"
coverage = [0] * len(myseq)
for start in starts:
  for i in range(start, start+readlen):
    coverage[i] += 1

import matplotlib, numpy
matplotlib.use('Agg')
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(4,2))
plt.plot(coverage)
fig.tight_layout()
plt.savefig(results)
results

Building a de Bruijn graph

If we traverse each read, we can construct a de Bruijn graph, vertices are fixed sized words of length \(k\), and two kmers \(a\) and \(b\) are connected by an edge if there is a read that contains \(a\) at position \(i\) and \(b\) at position \(i+1\).

k = 16
kmers = []
kmergraph = {}

for read in reads:
  read = "".join(read)
  for i in range(len(read) - k - 1):
    kmer_a = read[i:i+k]
    kmer_b = read[i+1:i+k+1]
    if kmer_a not in kmers:
       kmers.append(kmer_a)
    if kmer_b not in kmers:
       kmers.append(kmer_b)
    if kmer_a not in kmergraph.keys():
       kmergraph[kmer_a] = deque([kmer_b])
    elif kmer_b not in kmergraph[kmer_a]:
       kmergraph[kmer_a].append(kmer_b)
    if kmer_b not in kmergraph.keys():
       kmergraph[kmer_b] = deque([])

#print kmers[:10]

Assembly

Assembly now consists of constructing an Eulerian trail in the kmergraph, and reading off the sequence. To find an Eulerian trail I’m going to delete visited edges from the graph until there are no more edges.

def findstart(kmers, kmergraph, path):
  start = None
  for i in range(len(path)):
    if len(kmergraph[kmers[path[i]]]) > 0:
      start = path[i]
      break
  return start

def extendpath(kmers, kmergraph, path):
  start = findstart(kmers, kmergraph, path)
  splicepoint = start
  if start != None:
    newpath = []
    while len(kmergraph[kmers[start]]) > 0:
      next_kmer = kmergraph[kmers[start]].popleft()
      start = kmers.index(next_kmer)
      newpath.append(start)
    path[splicepoint:splicepoint+1] = newpath
  return start

If we start at vertex 0, we can extend the path until we can find no vertices that still have edges. We could start anywhere, and different start points may find different contigs. I checked the first couple hundred vertices, and starting at position 330 produced the longest assembly.

for i in range(330,331):
  start = i
  print start, 
  path = [start]
  next = extendpath(kmers, kmergraph, path)
  while (next != None):
    next = extendpath(kmers, kmergraph, path)
  print len(path)
  k = 16
  kmers = []
  kmergraph = {}

  for read in reads:
    read = "".join(read)
    for i in range(len(read) - k - 1):
      kmer_a = read[i:i+k]
      kmer_b = read[i+1:i+k+1]
      if kmer_a not in kmers:
	 kmers.append(kmer_a)
      if kmer_b not in kmers:
	 kmers.append(kmer_b)
      if kmer_a not in kmergraph.keys():
	 kmergraph[kmer_a] = deque([kmer_b])
      elif kmer_b not in kmergraph[kmer_a]:
	 kmergraph[kmer_a].append(kmer_b)
      if kmer_b not in kmergraph.keys():
	 kmergraph[kmer_b] = deque([])

  #print kmers[:10]

We can follow the trail, pulling the last character from each vertex to reconstruct the sequence.

assembly = kmers[path[0]]
for i in range(1, len(path)):
  assembly += kmers[path[i]][-1]

len(assembly)
253

We successfully recovered 253 bases in the 1000 base sequence.

def wrap(seq):
  start = 0
  for i in range(60, len(seq), 60):
    print seq[start: i]
    start = i
  print seq[start:]
print ">myseq"
wrap("".join(myseq))
print
print ">assembly"
wrap(assembly)
print
>myseq
ttccgctccgtgctgcttttcgtgcacgttctctgagctgactactagattcacgtaggg
tgtggcgcgcaaggcattttttgcgctttgtgcgtttagcgtagaatctaagagtggagg
gcctaataaattacacagcagaatacgttagtagtccgcaccggcctcgagcacatccct
gggccgaacagctgccgcgaccagcgttccctttatgagtcgcagatgaagtctatcacc
ttaggctcaaggtttaggggtgcgagaactgcgaatccgccaaagacccatcttccgccg
ttgctgaagaacgccggcctctccatgtcgacaataaacgattacgtctcccgagacttt
aacttggtaataaaatcaagtgtgggtttgtaggtctcgttgaagcattcctagactaac
ctggatcgcacaaggagcctctgcgcaggtattttgtgtatctctaatcttggaatttgc
cacgttcatgacgattgaacacattataagtaacagtaactatcccttgatagatttcat
ccaaaaacggacgacctaccgcaccttcggtcctcactacgcaaggactggagcgtatct
atacggcaccttgatccgataccggggcgcttccgagcgagccccgcgacgagactagat
gggaagttggatcaccactccggatctctcaatgacaaaagaggggctccggttttgcga
caaagatgtctccattcccaacagaaggctccgatacgattcaaatctcacttaacattc
cgcgaccggcaatagacggggtagagcgggataggagcatgcgaactgatttggagttcg
gtatgccaaggctcctcggttcaagtccggcgccggaagaaagaatctaccactgcccgg
gcatccaatggacttaatatgaatggcagctgataacaacgccccctgttgcgcgccaga
ggtcgagcttagcttgggcttcgatgtcgctgtaaaattc

>assembly
tttgcgacaaagatgtctccattcccaacagaaggctccgatacgattcaaatctcactt
aacattccgcgaccggcaatagacggggtagagcgggataggagcatgcgaactgatttg
gagttcggtatgccaaggctcctcggttcaagtccggcgccggaagaaagaatctaccac
tgcccgggcatccaatggacttaatatgaatggcagctgataacaacgccccctgttgcg
cgccagaggtcga