Date

Review: Locality sensitive hashing for the edit distance

I'm reviewing some of the findings from the preprint:

Locality sensitive hashing for the edit distance Guillaume Marcais, Dan DeBlasio, Prashant Pandey, Carl Kingsford bioRxiv 534446; doi: https://doi.org/10.1101/534446

Problem

Determine if two sequences overlap.

Existing solutions

Locality sensitive hashing:

  • LSH Jaccard similarity
  • LSH Hamming distance

Proposal

Develop a LSH for edit distance.

Hamming distance

S1    GTACACTA
S2    TTATACTG
      --------
      x  x   x

Hd(S1, S2) = 3 / 8

Similarity

S = 1 - D

Jaccard similarity

S1 {GTA TAC ACA CAC ACT CTA}
S2 {TTA TAT ATA TAC ACT CTG}

{S1 n S2} = {ACT}
{S1 U S2} = {GTA TAC ACA CAC ACT CTA TTA TAT ATA TAC CTG}}

J(S1, S2) = |S1 n S2|/|S1 U S2| = 1/11

Edit distance

S1 GTACACTA
S2 TTAACTA

Ed(S1,S2) = 2/8

Locality Sensitive Hash

Gapped LSH

gapped LSH

LSH

LSH

Order MinHash

A concrete example of Figure 1, with m = GTA and k = 3.

GTACACGTATGTAC  {(GTA 0) (GTA 1) (GTA 2)}
---   --- ---
TCGTATGTAC {(GTA 0) (GTA 1)}
  --- ---