I went to the 2014 Tech Summit Hackathon, but knew I had to leave early, so I did a solo hack.

One of the newly released data sets included some historical GPS data from the Autoridad Metropolitana de Autobuses. I decided to try to look at the bus service on Route 18, the one closest to my home. It was bad in the 80’s when I went to college, and in the 10’s I’d heard complaints too.

So far, the data seems to indicate service has improved, although there are hints of problems. I’ve put some ipython code up at nbviewer, check it out for yourselves, you can see the notebook, or get the source code.

Introduction

I was looking for a project to learn how to use the GPIO on the raspberry pi, and started with an adafruit tutorial on checking mail and turning on LEDs. Since I get too much email, I decided to change the application to check for sucessful and failed ssh attempts on the rpi.

I learned some nifty things about processing files in python using generators, or streams.

ssh logs successful and failed attempts to /var/log/auth.log on the pi.

$ grep sshd /var/log/auth.log
May  5 08:42:54 raspberrypi sshd[21479]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=humbertos-macbook.local  user=pi
May  5 08:42:56 raspberrypi sshd[21479]: Failed password for pi from 10.0.1.4 port 65157 ssh2
May  5 08:43:09 raspberrypi sshd[21479]: Failed password for pi from 10.0.1.4 port 65157 ssh2
May  5 08:43:33 raspberrypi sshd[21479]: Failed password for pi from 10.0.1.4 port 65157 ssh2
May  5 08:43:33 raspberrypi sshd[21479]: Connection closed by 10.0.1.4 [preauth]
May  5 08:43:33 raspberrypi sshd[21479]: PAM 2 more authentication failures; logname= uid=0 euid=0 tty=ssh ruser= rhost=humbertos-macbook.local  user=pi
May  5 08:44:10 raspberrypi sshd[21523]: Address 10.0.1.4 maps to humbertos-macbook.local, but this does not map back to the address - POSSIBLE BREAK-IN ATTEMPT!
May  5 08:44:17 raspberrypi sshd[21523]: Accepted password for pi from 10.0.1.4 port 65174 ssh2
May  5 08:44:17 raspberrypi sshd[21523]: pam_unix(sshd:session): session opened for user pi by (uid=0)
May  5 08:54:29 raspberrypi sshd[21568]: Received disconnect from 10.0.1.4: 11: disconnected by user
May  5 08:54:29 raspberrypi sshd[21523]: pam_unix(sshd:session): session closed for user pi
May  5 08:54:34 raspberrypi sshd[23682]: Accepted publickey for pi from 10.0.1.4 port 65320 ssh2
May  5 08:54:34 raspberrypi sshd[23682]: pam_unix(sshd:session): session opened for user pi by (uid=0)
May  5 08:54:48 raspberrypi sshd[24935]: Received signal 15; terminating.
May  5 08:54:48 raspberrypi sshd[23682]: pam_unix(sshd:session): session closed for user pi
May  5 08:55:07 raspberrypi sshd[2215]: Server listening on 0.0.0.0 port 22.
May  5 08:55:12 raspberrypi sshd[2215]: Received signal 15; terminating.
May  5 08:55:12 raspberrypi sshd[2292]: Server listening on 0.0.0.0 port 22.
May  5 08:56:16 raspberrypi sshd[2300]: Accepted publickey for pi from 10.0.1.4 port 65336 ssh2
May  5 08:56:16 raspberrypi sshd[2300]: pam_unix(sshd:session): session opened for user pi by (uid=0)
May  5 18:37:28 raspberrypi sshd[2300]: pam_unix(sshd:session): session closed for user pi
May  6 13:11:00 raspberrypi sshd[5331]: Accepted publickey for pi from 10.0.1.4 port 57060 ssh2
May  6 13:11:00 raspberrypi sshd[5331]: pam_unix(sshd:session): session opened for user pi by (uid=0)

It looks like no matter the method of authentication, we can just check for sshd logs that say “Failed” or “Accepted”.

Methods

Following the tutorial on blinking leds, I constructed a circuit with a red LED on pin 23 on the pi, and a green LED on pin 18.

We can set up code to initialize the leds to off, and to blink a specified led.

import RPi.GPIO as GPIO
import time

GPIO.setmode(GPIO.BCM)
GREEN_LED = 18
RED_LED = 23
GPIO.setup(GREEN_LED, GPIO.OUT)
GPIO.setup(RED_LED, GPIO.OUT)
GPIO.output(RED_LED, False)
GPIO.output(GREEN_LED, False)

def blink_led(led):
    GPIO.output(led, True)
    time.sleep(1.0)
    GPIO.output(led, False)

We’re going to use python generators to produce a stream, an infinite list of sshd log entries from /var/log/auth.log. If these lines match “Failed” we blink the red LED, if they match “Accepted”, we blink the green LED.

def follow(thefile):
    thefile.seek(0,2) # Go to the end of the file
    while True:
	line = thefile.readline()
	if not line:
	    time.sleep(0.1) # Sleep briefly
	    continue
	yield line

if __name__ == "__main__":
    log = open("/var/log/auth.log")
    lines = follow(log)
    lines = (line for line in lines if "sshd" in line)

    for line in lines:
	if "Failed" in line:
	    blink_led(RED_LED)

	if "Accepted" in line:
	    blink_led(GREEN_LED)

GPIO.cleanup()

Results

This program loops forever, waiting for ssh login attempts and blinking the appropriate LED.

Discussion

I like the program, and python generators are a lot cooler than I thought.

It’s been a long time since I go out and row, but I got in a good workout yesterday. I rowed 9 miles on the paseo lineal, from Santa Rosa to the beach and back. Decent time too, 1:23.

I’m a little sore today, but it was worth it. I hope I can get a little more exercise this summer.

Introduction

In a recent post, I showed how to build a sequence assembly using the Eulerian path algorithm. I noted that starting the Eulerian path algorithm on different graph vertices yielded paths of different lengths.

I’m going to try to see why, and find the best assembly I can.

Data structures

Last time, we just used a list and a dictionary to represnt the graph. Let’s try to build a python class.

from collections import deque

class KmerGraph:
    kmers = []
    neighbors = {}
    collisions = 0

    def readstograph(self, reads, k):
      for read in reads:
	  read = "".join(read)
	  for i in range(len(read) - k - 1):
	    kmer_a = read[i:i+k]
	    kmer_b = read[i+1:i+k+1]
	    if kmer_a not in self.kmers:
	      self.kmers.append(kmer_a)
	    else:
	      self.collisions += 1
	    if kmer_b not in self.kmers:
	      self.kmers.append(kmer_b)
	    else:
	      self.collisions +- 1
	    if not self.neighbors.has_key(kmer_a):
	      self.neighbors[kmer_a] = deque([kmer_b])
	    elif kmer_b not in self.neighbors[kmer_a]:
	      self.neighbors[kmer_a].append(kmer_b)
	    if kmer_b not in self.neighbors.keys():
	      self.neighbors[kmer_b] = deque([])
    def findstart(self, path):
      """Traverse the current path, looking for a node with available edges"""
      start = None
      for i in range(len(path)):
	if len(self.neighbors[self.kmers[path[i]]]) > 0:
	  start = path[i]
	  break
      return start

    def extendpath(self, path):
      start = self.findstart(path)
      splicepoint = start
      if start != None:
	newpath = []
	while len(self.neighbors[self.kmers[start]]) > 0:
	  next_kmer = self.neighbors[self.kmers[start]].popleft()
	  start = self.kmers.index(next_kmer)
	  newpath.append(start)
	path[splicepoint:splicepoint+1] = newpath
      return start

Make a random sequnce

import random
random.seed(0)
myseq = [random.choice(['a', 'c', 'g', 't']) for i in range(1000)]
myseq[:10]

Simulating reads

To simulate reads, pick random starting points, and slice out fixed length strings.

readlen = 25
starts = [random.randint(0, len(myseq)-readlen) for i in range(500)]
reads = [myseq[start:start+readlen] for start in  starts]
reads[0]
t g g c a g c t g a t a a c a a c g c c c c c t g

Build a graph

kg = KmerGraph()
kg.readstograph(reads,16)
kg.collisions
len(kg.kmers)

Assembly

Assembly now consists of constructing an Eulerian trail in the k-mer graph, and reading off the sequence. To find an Eulerian trail I’m going to delete visited edges from the graph until there are no more edges.

def findstart(self, path):
  """Traverse the current path, looking for a node with available edges"""
  start = None
  for i in range(len(path)):
    if len(self.neighbors[self.kmers[path[i]]]) > 0:
      start = path[i]
      break
  return start

def extendpath(self, path):
  start = self.findstart(path)
  splicepoint = start
  if start != None:
    newpath = []
    while len(self.neighbors[self.kmers[start]]) > 0:
      next_kmer = self.neighbors[self.kmers[start]].popleft()
      start = self.kmers.index(next_kmer)
      newpath.append(start)
    path[splicepoint:splicepoint+1] = newpath
  return start

Find the longest path

for k in range(9,19):
  kg = KmerGraph()
  kg.readstograph(reads, k)
  nodes = len(kg.kmers)
  maxi = 0
  maxl = -1
  for i in range(nodes):
    path = [i]
    next = kg.extendpath(path)
    while (next != None):
      next = kg.extendpath(path)
    if len(path) > maxl:
      maxi = i
      maxl = len(path)
    kg = KmerGraph()
    kg.readstograph(reads, k)

  print k, maxi, maxl
9 1930 739
10 2911 738
11 3888 737
12 4860 736
13 5357 517
14 6304 516
15 7245 273
16 920 241
17 8334 207
18 9243 146

I went for a quick row on the Parque paseo lineal del Rio Bayamón. I hit a pretty good pace, it was overcast and cool. It sprinkled a little when I started and when I finished, but the weather cooperated. My shoulders are sore, I hadn’t gone that fast in a long time.

Abstract

Org-mode is a text mode in the emacs editor. Recent versions of Org-mode include support for embedding code, data, and figures in a document. These features can be used to aid publication of research results. This document will show some simple examples of the available features.

Introduction

Emacs Org-mode is a set of tools for managing information in plain text files. Recent versions include Babel, a set of features for embedding active source code in a document. Babel draws inspiration from literate programming tools like noweb, but extends beyond this by providing support for reproducible research, like Sweave.

Org-mode’s reproducible research support works with multiple languages (even in the same document), so you can use a shell script to generate data, analize it in R, and plot it with python. Org-mode also supports export to many different formats, including HTML, LaTeX, and OpenOffice, so you can publish your research in the most convenient format.

This document will show some of the features of Org-mode by working through a simple sequence assembly problem.

Sequence Assembly

Sequence assembly is the task of reconstructing a complete sequence from a series of fragments or “reads”. I’ll use python to develop the example. You’ll need to set up python in Org-mode as described in the documentation for ob-doc-python. Let’s start by generating a random sequence to analyze.

Making a random sequence

from collections import deque
import random
random.seed(0)
myseq = [random.choice(['a', 'c', 'g', 't']) for i in range(1000)]
print myseq[:10]
['t', 't', 'c', 'c', 'g', 'c', 't', 'c', 'c', 'g']

Simulating reads

To simulate reads, pick random starting points, and slice out fixed length strings.

readlen = 25
starts = [random.randint(0, len(myseq)-readlen) for i in range(500)]
reads = [myseq[start:start+readlen] for start in  starts]
print reads[0]
>>> >>> ['t', 'g', 'g', 'c', 'a', 'g', 'c', 't', 'g', 'a', 't', 'a', 'a', 'c', 'a', 'a', 'c', 'g', 'c', 'c', 'c', 'c', 'c', 't', 'g']

Checking coverage

OK, so let’s have a look at our coverage.

results = "coverage.png"
coverage = [0] * len(myseq)
for start in starts:
  for i in range(start, start+readlen):
    coverage[i] += 1

import matplotlib, numpy
matplotlib.use('Agg')
import matplotlib.pyplot as plt
fig=plt.figure(figsize=(4,2))
plt.plot(coverage)
fig.tight_layout()
plt.savefig(results)
results

Building a de Bruijn graph

If we traverse each read, we can construct a de Bruijn graph, vertices are fixed sized words of length \(k\), and two kmers \(a\) and \(b\) are connected by an edge if there is a read that contains \(a\) at position \(i\) and \(b\) at position \(i+1\).

k = 16
kmers = []
kmergraph = {}

for read in reads:
  read = "".join(read)
  for i in range(len(read) - k - 1):
    kmer_a = read[i:i+k]
    kmer_b = read[i+1:i+k+1]
    if kmer_a not in kmers:
       kmers.append(kmer_a)
    if kmer_b not in kmers:
       kmers.append(kmer_b)
    if kmer_a not in kmergraph.keys():
       kmergraph[kmer_a] = deque([kmer_b])
    elif kmer_b not in kmergraph[kmer_a]:
       kmergraph[kmer_a].append(kmer_b)
    if kmer_b not in kmergraph.keys():
       kmergraph[kmer_b] = deque([])

#print kmers[:10]

Assembly

Assembly now consists of constructing an Eulerian trail in the kmergraph, and reading off the sequence. To find an Eulerian trail I’m going to delete visited edges from the graph until there are no more edges.

def findstart(kmers, kmergraph, path):
  start = None
  for i in range(len(path)):
    if len(kmergraph[kmers[path[i]]]) > 0:
      start = path[i]
      break
  return start

def extendpath(kmers, kmergraph, path):
  start = findstart(kmers, kmergraph, path)
  splicepoint = start
  if start != None:
    newpath = []
    while len(kmergraph[kmers[start]]) > 0:
      next_kmer = kmergraph[kmers[start]].popleft()
      start = kmers.index(next_kmer)
      newpath.append(start)
    path[splicepoint:splicepoint+1] = newpath
  return start

If we start at vertex 0, we can extend the path until we can find no vertices that still have edges. We could start anywhere, and different start points may find different contigs. I checked the first couple hundred vertices, and starting at position 330 produced the longest assembly.

for i in range(330,331):
  start = i
  print start, 
  path = [start]
  next = extendpath(kmers, kmergraph, path)
  while (next != None):
    next = extendpath(kmers, kmergraph, path)
  print len(path)
  k = 16
  kmers = []
  kmergraph = {}

  for read in reads:
    read = "".join(read)
    for i in range(len(read) - k - 1):
      kmer_a = read[i:i+k]
      kmer_b = read[i+1:i+k+1]
      if kmer_a not in kmers:
	 kmers.append(kmer_a)
      if kmer_b not in kmers:
	 kmers.append(kmer_b)
      if kmer_a not in kmergraph.keys():
	 kmergraph[kmer_a] = deque([kmer_b])
      elif kmer_b not in kmergraph[kmer_a]:
	 kmergraph[kmer_a].append(kmer_b)
      if kmer_b not in kmergraph.keys():
	 kmergraph[kmer_b] = deque([])

  #print kmers[:10]

We can follow the trail, pulling the last character from each vertex to reconstruct the sequence.

assembly = kmers[path[0]]
for i in range(1, len(path)):
  assembly += kmers[path[i]][-1]

len(assembly)
253

We successfully recovered 253 bases in the 1000 base sequence.

def wrap(seq):
  start = 0
  for i in range(60, len(seq), 60):
    print seq[start: i]
    start = i
  print seq[start:]
print ">myseq"
wrap("".join(myseq))
print
print ">assembly"
wrap(assembly)
print
>myseq
ttccgctccgtgctgcttttcgtgcacgttctctgagctgactactagattcacgtaggg
tgtggcgcgcaaggcattttttgcgctttgtgcgtttagcgtagaatctaagagtggagg
gcctaataaattacacagcagaatacgttagtagtccgcaccggcctcgagcacatccct
gggccgaacagctgccgcgaccagcgttccctttatgagtcgcagatgaagtctatcacc
ttaggctcaaggtttaggggtgcgagaactgcgaatccgccaaagacccatcttccgccg
ttgctgaagaacgccggcctctccatgtcgacaataaacgattacgtctcccgagacttt
aacttggtaataaaatcaagtgtgggtttgtaggtctcgttgaagcattcctagactaac
ctggatcgcacaaggagcctctgcgcaggtattttgtgtatctctaatcttggaatttgc
cacgttcatgacgattgaacacattataagtaacagtaactatcccttgatagatttcat
ccaaaaacggacgacctaccgcaccttcggtcctcactacgcaaggactggagcgtatct
atacggcaccttgatccgataccggggcgcttccgagcgagccccgcgacgagactagat
gggaagttggatcaccactccggatctctcaatgacaaaagaggggctccggttttgcga
caaagatgtctccattcccaacagaaggctccgatacgattcaaatctcacttaacattc
cgcgaccggcaatagacggggtagagcgggataggagcatgcgaactgatttggagttcg
gtatgccaaggctcctcggttcaagtccggcgccggaagaaagaatctaccactgcccgg
gcatccaatggacttaatatgaatggcagctgataacaacgccccctgttgcgcgccaga
ggtcgagcttagcttgggcttcgatgtcgctgtaaaattc

>assembly
tttgcgacaaagatgtctccattcccaacagaaggctccgatacgattcaaatctcactt
aacattccgcgaccggcaatagacggggtagagcgggataggagcatgcgaactgatttg
gagttcggtatgccaaggctcctcggttcaagtccggcgccggaagaaagaatctaccac
tgcccgggcatccaatggacttaatatgaatggcagctgataacaacgccccctgttgcg
cgccagaggtcga

Abstract

The Seattle distributed system is a framework for running python programs across many computers on the Internet.

Introduction

I’ve been interested in distributed scripting since I was a graduate student in The University of Texas at San Antonio. I was pleasantly surprised to see that there is a new python implementation of distributed scripting.

Seattle is geared towards creation of distributed network programs, but uses a restricted python interpreter, you can write arbitrary programs as well.

Implementation

I haven’t seen the source of entire system, but there are three main components. Users run a seattle server and donate compute resources to the Seattle community. The Seattle clearinghouse arbitrates access to the donated resources, and manages reservations on one or more nodes (called “vessels” in Seattle). The devkit (available for download on your clearinghouse profile page) provides tools to develop Seattle programs and run on the vessels reserved for you.

Users are identified by public/private keypairs, and have access to a restricted subset of python called Repy. Several example programs are included in the devkit, and more are available on the Seattle website.

Repy limits the features available to programmers, but you can develop a program in another language and call repy to perform distributed computing, or use Repy or SeattleLib features to implement your program. Developer specific information is available online.