Bioinformatics in Butterfly Madness

Humberto Ortiz-Zuazaga humberto.ortiz@upr.edu
Department of Computer Science
Rio Piedras Campus
University of Puerto Rico

Who am I?

  • Associate Professor - Department of Computer Science
  • Director - UPR High Performance Computing facility

What do I do?

  • Develop infrastructure
    • UPR High Performance Computing facility
      • 2500+ core linux cluster with 280 TB disk space
      • Internet2 for the UPR system
    • PR-NETS
      • 10 Gbps Internet backbone for Rio Piedras
      • Connects Facundo Bueso, Julio Garcia Diaz, CN Fase I and Fase II

What do I do?

What do I do?

  • Oh, yeah. Research too.
    • Gene expression
      • Velda J Gonzalez, Leorey N Saligan, Brooke L Fridley, Humberto Ortiz-Zuazaga, Lauren S Aaronson, "Gene Expression, and Fatigue in Puerto Rican Men during Radiotherapy for Prostate Cancer: an Exploratory Study", Puerto Rico health sciences journal, Vol 36, No 4 (2017). http://prhsj.rcm.upr.edu/index.php/prhsj/article/view/1534
    • Bioinformatics software
      • Van Belleghem, Steven; Papa, Riccardo; Ortiz-Zuazaga, Humberto; Hendrickx, Frederik; Jiggins, Chris; McMillan, W.; Counterman, Brian. patternize: An R package for quantifying color pattern variation. Methods in Ecology and Evolution. 3 August 2017. doi:10.1111/2041-210X.12853
      • Crusoe MR, Alameldin HF, Awad S et al. The khmer software package: enabling efficient nucleotide sequence analysis [version 1; referees: 2 approved, 1 approved with reservations] F1000Research 2015, 4:900 doi: 10.12688/f1000research.6924.1
    • Cybersecurity
      • J Ortiz-Ubarri, H Ortiz-Zuazaga, A Maldonado, E Santos, J Grullón. Toa: A Web Based Network Flow Data Monitoring System at Scale. 2015 IEEE International Congress on Big Data, 438-443. https://doi.org/10.1109/BigDataCongress.2015.71
    • More stuff

No, really, what do you do?

  • Model DNA as sequences of letters
In [1]:
import random
random.seed(0)
myseq = "".join([random.choice(['a', 'c', 'g', 't']) for i in range(2500)])
print (myseq[:60], "...")
ttagttgtgccgcagcgaagtagtgcttgaaatatgcgacccctaagtaggagcgtatgc ...
In [2]:
readlen = 36
starts = [random.randint(0, len(myseq)-readlen) for i in range(10000)]
reads = [myseq[start:start+readlen] for start in  starts]
print (reads[:10])
['actcttgactctatatgctatagtttgccgcgctaa', 'aatccccccactgggatgcggctgctgttgttacca', 'tggggctcactgtatcagccgtaccgctatctacct', 'tctctactgcaaacctcatggtccggtttgtaccac', 'gacctcgctaatattctgctgtagctcgttatattc', 'gacgcgtaaccaaaacatagaaaccatcaatagaca', 'tagtttaatgaaaggatacgaatgccccccgattac', 'ggcgtgcagcgccgaacgggggtttcacatcgatgc', 'gttgaccttgagctcggtacagcgtcggcgagacga', 'accagatgcctgtaaatcgtttcaacgggatggttt']
In [3]:
def makebad(read):
    bad = ""
    for base in read:
        if 3 > random.randint(0, 100):
            # random substitution in 3% of bases
            bad = bad + random.choice(['A', 'C', 'G', 'T'])
        else:
            bad = bad + base
    return bad
In [4]:
badreads = [makebad(read) for read in reads]
In [5]:
badreads[:10]
Out[5]:
['actcttgactctatatgctatCgtttgccgcgctaa',
 'aatccccccactgggatgcggctgctgttgttacca',
 'tggggctcactgtatcagccgtaccgctatcTacct',
 'tctctactgcaaacctcatggtccggtttgtaccac',
 'gacctcgctaatattctgctgtagctcgttatattc',
 'gacgcgtaaccaaaacatagaaaccatcaatagaca',
 'tagtttaatgaaaggatacgaatgccccccgCttTc',
 'ggcgtgcagcgccgaacgggggtttcacatcgatgc',
 'gttgaTcttgagctcggtacagcgtcggcgagacga',
 'accagatgcctgtaaatcgtttcaacgggatggttt']

Modeling DNA as graphs

In [6]:
import networkx as nx

from collections import defaultdict

def build_graph(reads):
    DG=nx.DiGraph()

    k = 24

    kmers = defaultdict(int)

    for read in reads:
      for i in range(len(read) - k - 1):
        kmer_a = read[i:i+k]
        kmer_b = read[i+1:i+k+1]
        kmers[kmer_a] += 1
        kmers[kmer_b] += 1
        DG.add_edge(kmer_a, kmer_b)
    return DG
In [7]:
goodgraph = build_graph(reads)
nx.nx_agraph.write_dot(goodgraph, "goodreads.dot")
In [8]:
badgraph = build_graph(badreads)
nx.nx_agraph.write_dot(badgraph, "badreads.dot")

"Good" reads

Graph of kmers without errors.

"Bad" reads

Graph of kmers from reads with 3% per base error

Pruning low abundance k-mers

In [9]:
DG = build_graph(badreads)
In [10]:
def prune(G, reads, k):
    min_count = 2

    kmers = defaultdict(int)

    for read in reads:
      for i in range(len(read) - k - 1):
        kmer_a = read[i:i+k]
        kmer_b = read[i+1:i+k+1]
        kmers[kmer_a] += 1
        kmers[kmer_b] += 1
        
    for kmer in G.nodes():
        if kmers[kmer] < min_count:
            G.remove_node(kmer)

    return G
In [11]:
better = prune(DG, badreads, 24)
In [12]:
nx.nx_agraph.write_dot(better, "better-reads.dot")

"Better" reads

Graph of kmers with at least one ocurrence in the graph

Real assembler

Trinity can assemble our bad reads with no trouble:

In [13]:
f = open("badreads.fasta", "w")

readnum = 0
for read in badreads:
    f.write(">read" + str(readnum) + "\n")
    f.write(read)
    f.write("\n")
    readnum += 1
f.close()

Trinity output

>TRINITY_DN0_c0_g1_i1 len=2500 path=[1:0-2499] [-1, 1, -2]
TAGTTACACTCTTGGGGGTGAGTGGACCATTTGATTCTACTGAAGGTGCAGGCAGGCTAG
TAAACGGCTCCTGCGCATGAGGTCATTCCACCTAAGGTGGAGTGCCTACAAGGTGTAGGG
TAATGGGTGATATTAGAGAACTGGAACTACACTGGCAGAGTTAGTACGCCCACTGCGACA
GGCTAGGATGGTGCTCGACTGGAGTGCGCACGGGGGGCCGGATCTTTGTCTTTTGCAACA
CGCTTTACAACCTAGGGTTCACGCACGATGCGGAATGCGGCGCGACGCGGCACGAAGACC
AGATGACAGATGCAGTAAACCGCTCTTTGACTATGTCCATGCACAACCCTTATGGTGCAG
GTAAAAACTTAACACGCTTCGCTCTGGGCTTTTAGCCCATGGGGATAACGATTAGTGATG
CTAAATGTTTTCTCCATCAGTTGAGAAACTGATCCAACCGCTGCCATGTATTGAGAGGCG
AGCCTGTTATCCAAGCGTTTCCGGGTCTGTTACGTACACCAGGCGAGGCAAACCCAGACA
CAGATGCCCCCGTCGAGATTTGGGTCACACAAGTTTTGTAAGAAGATGATCGGGGGAGGT
CGGCGGGGCGTAAACCAGTACTAGGTCTGTAAGTGCATAACCGCAATCGGTCCCTCTTTC
AGCGCATTAATACGCGTCTTTCCCTTCGCAGGCCGCAGAGGGTCGTAGTTCTATCGCCTA
GTCTGCCCTACCTGATTGGATTACAATACGTTCGCAAGCATCAGCTGTTCCAGTATTACA
GTGAATGATTCGAAGTGGTACAAACCGGACCATGAGGTTTGCAGTAGAGAGCGCTAGGAA
GAAGCTCGTTCGTGACCAGGAACGAGAATATAACGAGCTACAGCAGAATATTAGCGAGGT
CTTCTGACCCTCTAGATAAAGGGTATGATGTGTATTGTCGTGCGCTATTAGACATTCGAC
AGAAAATGACGGCGATCATCACCACCGACATGACCAGCGGACGACTTCCATGCCAGGTTC
CATAATTGGTGTCGCTGACTAAATTAGCGCGGCAAACTATAGCATATAGAGTCAAGAGTG
TGGACCATCACTTTTAGCTGTTGGTAACAACAGCAGCCGCATCCCAGTGGGGGGATTAAA
TGCTGGCGACATACTACCGCGTACGCCTCGAACGGAATAAATAGTTACTCGATCCCTAAG
AGAAGAATCAGTAGCGGCAATAGGTTTTCTTGGATGTAGCAAGGCGGCCGGTAATCGGGG
GGCATTCGTATCCTTTCATTAAACTAGTCTTTCTAGCATGAACCCATATCGTACTATGCT
...