Comparing DNA sequences as strings

In the beginning, was the string

One of the common ways to represent Nucleobases like DNA is simply with a succession of letters like ‘TACGCTGTTATCCCTAAAG’. So, in essence, all we are talking about here is comparing two sets of strings. Of course, there are other methods to verify sequence fingerprints and data traceability but here I am just focusing on the sequence alphabets (a use case here is to perform sequence similarity searches that identify which sequence entries are identical or similar to each other).

Image source: https://en.wikipedia.org/wiki/DNA
Image source:https://en.wikipedia.org/wiki/DNA

Keep it simple

First, without using Python or any other tool, we can just rely on simple bash (Linux shell) functionalities to either run a diff or MD5 checksum to compare the strings.

$ echo -n "tacgctgtta tccctaaagt"|md5
d279afa8c74582037ab7cd72b77c0664

Data sources

For this example, I am comparing data stored in the European Nucleotide Archive (ENA) and National Center for Biochemical Information (NCBI) — these are the two leading providers of sequence data. Because of the International Nucleotide Sequence Database Collaboration (INSDC) there are some common formats and interoperability features. For example, they both use the same accession number. I am grabbing data for an organism called Stylodactylus serratus (kind of shrimp in plain English).

wget 'https://www.ebi.ac.uk/ena/data/view/AM076944&display=fasta&download=fasta&filename=AM076944.fasta' -O ena.fasta$cat ena.fast 
>ENA|AM076944|AM076944.1 Stylodactylus serratus mitochondrial partial 16S rRNA gene
TACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAATAATCTA
TTTATAAATATTTAATTTATAAAACAGTTAAAAATTTTATTGGGGCCGCCCCAGCCAAAC
AACTTGTATTTAATCCCATTTAATATAAATTTAAAAACTAAAATTCACTTGTAAAGTTTT
ATAGGGTCTTATCGTCCCTTCAGTTTATTTAAGCCTTTTCACTTAAAAGTAAAGTTTAAA
TTACACCAGTAAGACAGCTTCCCTTTTGTTCAACCATTCATTCCAGCCTCCAATTAAGAG
wget 'https://www.ncbi.nlm.nih.gov/search/api/sequence/AM076944/?report=fasta' -O  ncbi.fasta>AM076944.1 Stylodactylus serratus mitochondrial partial 16S rRNA geneTACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAATAATCTATTTATAAATATTTAATTTATAAAACAGTTAAAAATTTTATTGGGGCCGCCCCAGCCAAACAACTTGTATTTAATCCCATTTAATATAAATTTAAAAACTAAAATTCACTTGTAAAGTTTTATAGGGTCTTATCGTCCCTTCAGTTTATTTAAGCCTTTTCACTTAAAAGTAAAGTTTAAATTACACCAGTAAGACAGCTTCCCTTTTGTTCAACCATTCATTCCAGCCTCCAATTAAGAG

Biopython

Biopython

>>> from Bio import SeqIO
>>> enarecord = SeqIO.read("ena.fasta", "fasta")
>>> enarecord.seq
Seq('TACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAAT...GAG', SingleLetterAlphabet())
>>> enarecord.seq==ncbirecord.seq
True
>>> hash(str(enarecord.seq))
4043515072338089544
>>> hash(str(ncbirecord.seq))
4043515072338089544

--

--

Data Architect@Distributed System of Scientific Collections (https://dissco.eu). PhD in Sociology. Bachelor's in Math and CS from the University of Illinois.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sharif Islam

Sharif Islam

Data Architect@Distributed System of Scientific Collections (https://dissco.eu). PhD in Sociology. Bachelor's in Math and CS from the University of Illinois.