Comparing DNA sequences as strings
In this post, I describe how to use Python to compare two DNA sequences from two different databases. This is not about any specific scientific use case. I am primarily interested in the details of different APIs and data structures that deal with sequences. I am also interested in data interoperability and FAIR. This example will break down things with a very specific string comparison exercise which can be a small building block for a complex distributed data infrastructure.
In the beginning, was the string
One of the common ways to represent Nucleobases like DNA is simply with a succession of letters like ‘TACGCTGTTATCCCTAAAG’. So, in essence, all we are talking about here is comparing two sets of strings. Of course, there are other methods to verify sequence fingerprints and data traceability but here I am just focusing on the sequence alphabets (a use case here is to perform sequence similarity searches that identify which sequence entries are identical or similar to each other).
Keep it simple
First, without using Python or any other tool, we can just rely on simple bash (Linux shell) functionalities to either run a diff or MD5 checksum to compare the strings.
$ echo -n "tacgctgtta tccctaaagt"|md5
d279afa8c74582037ab7cd72b77c0664
But if we can do this programmatically that is, of course, better as you will see in the example below.
Data sources
For this example, I am comparing data stored in the European Nucleotide Archive (ENA) and National Center for Biochemical Information (NCBI) — these are the two leading providers of sequence data. Because of the International Nucleotide Sequence Database Collaboration (INSDC) there are some common formats and interoperability features. For example, they both use the same accession number. I am grabbing data for an organism called Stylodactylus serratus (kind of shrimp in plain English).
First, get the files via wget:
From ENA
wget 'https://www.ebi.ac.uk/ena/data/view/AM076944&display=fasta&download=fasta&filename=AM076944.fasta' -O ena.fasta$cat ena.fast
>ENA|AM076944|AM076944.1 Stylodactylus serratus mitochondrial partial 16S rRNA gene
TACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAATAATCTA
TTTATAAATATTTAATTTATAAAACAGTTAAAAATTTTATTGGGGCCGCCCCAGCCAAAC
AACTTGTATTTAATCCCATTTAATATAAATTTAAAAACTAAAATTCACTTGTAAAGTTTT
ATAGGGTCTTATCGTCCCTTCAGTTTATTTAAGCCTTTTCACTTAAAAGTAAAGTTTAAA
TTACACCAGTAAGACAGCTTCCCTTTTGTTCAACCATTCATTCCAGCCTCCAATTAAGAG
From NCBI
wget 'https://www.ncbi.nlm.nih.gov/search/api/sequence/AM076944/?report=fasta' -O ncbi.fasta>AM076944.1 Stylodactylus serratus mitochondrial partial 16S rRNA geneTACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAATAATCTATTTATAAATATTTAATTTATAAAACAGTTAAAAATTTTATTGGGGCCGCCCCAGCCAAACAACTTGTATTTAATCCCATTTAATATAAATTTAAAAACTAAAATTCACTTGTAAAGTTTTATAGGGTCTTATCGTCCCTTCAGTTTATTTAAGCCTTTTCACTTAAAAGTAAAGTTTAAATTACACCAGTAAGACAGCTTCCCTTTTGTTCAACCATTCATTCCAGCCTCCAATTAAGAG
AM0769944 is the accession number (basically an identifier that is shared across these two databases which makes mapping easier; FASTA format is a common way to store the sequence data in a text format which makes parsing and comparing possible. But as you can see there are different ways of storing and indenting the strings. That is why a simple string comparison will fail so we need programmable methods using Python.
Biopython
Read the files
>>> from Bio import SeqIO
>>> enarecord = SeqIO.read("ena.fasta", "fasta")
>>> enarecord.seq
Seq('TACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAAT...GAG', SingleLetterAlphabet())
In many ways, we can deal with Seq objects as if they were normal Python strings, for example getting the length or comparing the strings:
>>> enarecord.seq==ncbirecord.seq
True
We can also calculate the hash
>>> hash(str(enarecord.seq))
4043515072338089544>>> hash(str(ncbirecord.seq))
4043515072338089544
So with a very simple Python script, one can easily compare sequence data from different sources.