Sequence assembly with read errors¶
Humberto Ortiz-Zuazaga humberto.ortiz@upr.edu
Introduction¶
Example sequence assembly using the NetworkX python package, a pure python package for manipulating graphs.
I hold to the saying that you don't really understand something untill you can explain it to {insert non-specialist here}. A general-purpose computer is the ultimate non-specialist: It can't even think. I'm trying to explain Eulerian sequence assembly to my computer, just to make sure I really understand it. I started a couple of simple python programs (example 1, example 2) where I built everything myself.
This example will do much the same, but using the NetworkX python package, a pure python package for manipulating graphs. I think the example is clearer and simpler to understand because we can concentrate on the hard parts, instead of the easy parts.
import networkx as nx
DG=nx.DiGraph()
Let's just generate a random sequence of length 1000, and convert it into a string.
import random
random.seed(0)
myseq = "".join([random.choice(['a', 'c', 'g', 't']) for i in range(1000)])
print myseq[:60], "..."
ttccgctccgtgctgcttttcgtgcacgttctctgagctgactactagattcacgtaggg ...
Now we simulate a set of 500 reads of the given sequence. Just pick a starting point at random, and read off a fixed size chunk of sequence.
readlen = 25
starts = [random.randint(0, len(myseq)-readlen) for i in range(500)]
reads = [myseq[start:start+readlen] for start in starts]
print reads[0]
tggcagctgataacaacgccccctg
Generate some errors in the reads. I'll insert errors in the last 15 (3%) reads.
def makebad(read):
point = random.randint(0,len(read)-1)
bad = read[:point] + "x" + read[point+1:]
return bad
badreads = 15
for i in range(badreads):
reads[-i] = makebad(reads[-i])
reads[-20:]
['ttaacttggtaataaaatcaagtgt', 'aaatcaagtgtgggtttgtaggtct', 'aacagaaggctccgatacgattcaa', 'tgaagtctatcaccttaggctcaag', 'ttaatatgaatggcagctgataaca', 'gatgtctccattcccaacagaaggc', 'ttggaatttxccacgttcatgacga', 'ttcaagtccxgcgccggaagaaaga', 'ccxagactttaacttggtaataaaa', 'tcaagtgtgggtxtgtaggtctcgt', 'taaattacacagcagaatacgttxg', 'ttcaxatctcacttaacattccgcg', 'aggtctxgttgaagcattcctagac', 'tgacgattxaacacattataagtaa', 'atctctaatctxggaatttgccacg', 'gccccgcxacgagactagatgggaa', 'tggtaataaaaxcaagtgtgggttt', 'tgcccgggcatccaatggactxaat', 'tggatcgcacaaggagcctctgcgx', 'ccgagactttaacttggtaxtaaaa']
We should check if we've sampled the sequence enough to be able to reconstruct it. Let's check the coverage (how many reads overlap) for each base.
coverage = [0] * len(myseq)
for start in starts:
for i in range(start, start+readlen):
coverage[i] += 1
%matplotlib inline
import matplotlib, numpy
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(4,2))
plt.plot(coverage)
fig.tight_layout()
That doesn't look too bad, although the coverage around 200 looks a bit low. (PS, I love ipython, it embeds that plot right into the output).
Setting up the graph¶
k = 9
for read in reads:
for i in range(len(read) - k - 1):
kmer_a = read[i:i+k]
kmer_b = read[i+1:i+k+1]
DG.add_edge(kmer_a, kmer_b)
Assembling the sequence¶
So here's where stuff gets interesting. I'm basically sequencing single stranded DNA with no errors, and have decent coverage. Unfortunately, the k-mer graph breaks into three pieces. In sequencing parlance, these are three different contigs, as the sequence assembler can't join them. Raising the coverage (seqencing more reads) should fix it. (I checked, and if you generate 600 reads instead of 500 you get a single component).
components = list(nx.connected_components(DG.to_undirected()))
len(components)
3
map(len, components)
#components.close()
[833, 216, 33]
Since each read is basically unique, I can find the location of each read in the original sequence, and mark what component it came from. Component 0 is the longest piece, and runs from 200 to 900.
placement = [0] * len(myseq)
for i in range(len(components)):
for read in components[i]:
try:
place = myseq.index(read)
except ValueError:
continue
placement[place] = i
placefig=plt.figure(figsize=(4,2))
plt.plot(placement)
placefig.tight_layout()
We can take component 0, and choose the subgraph it induces on the directed k-mer graph.
best = nx.subgraph(DG, components[0])
The Eulerian path function is a hack. I just made the fewest modifications possible to eulerian_circuit. I don't think we correctly check the end of the path (should end up with indegree 0 and outdegree 0).
def eulerian_path(G, source=None):
"""Find an Eulerian path in a weakly connected component, stolen from eulerian_circuit"""
if not nx.is_weakly_connected(G):
raise nx.NetworkXError("G is not connected.")
g = G.__class__(G) # copy graph structure (not attributes)
# set starting node
if source is None:
for v in g.nodes():
indeg = g.in_degree(v)
outdeg = g.out_degree(v)
if outdeg == (indeg + 1):
break
else:
raise nx.NetworkXError("G doesn't have an Eulerian circuit.")
else:
v = source
while g.out_degree(v) > 0:
n = v
bridge = None
# sort nbrs here to provide stable ordering of alternate cycles
nbrs = sorted([v for u,v in g.edges(n)])
for v in nbrs:
g.remove_edge(n,v)
bridge = not nx.is_connected(g.to_undirected())
if bridge:
g.add_edge(n,v) # add this edge back and try another
else:
break # this edge is good, break the for loop
if bridge:
g.remove_edge(n,v)
g.remove_node(n)
yield (n,v)
path = eulerian_path(best)
nodepath = [u for u,v in path]
seq = nodepath[0] + "".join([node[-1] for node in nodepath[1:]])
seq in myseq
False
len(seq)
455
def wrap(seq):
start = 0
for i in range(60, len(seq), 60):
print seq[start: i]
start = i
print seq[start:]
wrap(seq)
ctctaatcttggaatttgccacgttcatgacgattgaacacattataagtaacagtaact atcccttgatagatttcatccaaaaacggacgacctaccgcaccttcggtcctcactacg caaggactggagcgtatctatacggcaccttgatccgataccggggcgcttccgagcgag ccccgcgacgagactagatgggaagttggatcaccactccggatctctcaatgacaaaag aggggctccggttttgcgacaaagatgtctccattcccaacagaaggctccgatacgatt caaatctcacttaacattccgcgaccggcaatagacggggtagagcgggataggagcatg cgaactgatttggagttcggtatgccaaggctcctcggttcaagtccxgcgccggaagaa agaatctaccactgcccgggcatccaatggactxa
That component was assembled with errors (the "x" are a giveaway. I'll export the graph into graphviz format and import it into gephi for visualization.
i = 0
for contig in components:
cg = nx.subgraph(DG, contig)
nx.write_dot(cg, "contig-" + str(i) + "-with-errors.dot")
i += 1
nx.write_dot(DG, "full-graph.dot")
The first contig, with bad kmers in red.
from IPython.display import SVG
SVG(filename='contig-0.svg')