Comparing DNA sequences as strings

In this post, I describe how to use Python to compare two DNA sequences from two different databases. This is not about any specific scientific use case. I am primarily interested in the details of different APIs and data structures that deal with sequences. I am also interested in data interoperability and FAIR. This example will break down things with a very specific string comparison exercise which can be a small building block for a complex distributed data infrastructure.

In the beginning, was the string

One of the common ways to represent Nucleobases like DNA is simply with a succession of letters like ‘TACGCTGTTATCCCTAAAG’. So, in essence, all we are talking about here is comparing two sets of strings. Of course, there are other methods to verify sequence fingerprints and data traceability but here I am just focusing on the sequence alphabets (a use case here is to perform sequence similarity searches that identify which sequence entries are identical or similar to each other).

Image source:
Image source:

Keep it simple

First, without using Python or any other tool, we can just rely on simple bash (Linux shell) functionalities to either run a diff or MD5 checksum to compare the strings.

$ echo -n "tacgctgtta tccctaaagt"|md5

But if we can do this programmatically that is, of course, better as you will see in the example below.

Data sources

For this example, I am comparing data stored in the European Nucleotide Archive (ENA) and National Center for Biochemical Information (NCBI) — these are the two leading providers of sequence data. Because of the International Nucleotide Sequence Database Collaboration (INSDC) there are some common formats and interoperability features. For example, they both use the same accession number. I am grabbing data for an organism called Stylodactylus serratus (kind of shrimp in plain English).

First, get the files via wget:
From ENA

wget '' -O ena.fasta$cat 
>ENA|AM076944|AM076944.1 Stylodactylus serratus mitochondrial partial 16S rRNA gene



AM0769944 is the accession number (basically an identifier that is shared across these two databases which makes mapping easier; FASTA format is a common way to store the sequence data in a text format which makes parsing and comparing possible. But as you can see there are different ways of storing and indenting the strings. That is why a simple string comparison will fail so we need programmable methods using Python.



Read the files

>>> from Bio import SeqIO
>>> enarecord ="ena.fasta", "fasta")
>>> enarecord.seq

In many ways, we can deal with Seq objects as if they were normal Python strings, for example getting the length or comparing the strings:

>>> enarecord.seq==ncbirecord.seq

We can also calculate the hash

>>> hash(str(enarecord.seq))
>>> hash(str(ncbirecord.seq))

So with a very simple Python script, one can easily compare sequence data from different sources.

Data Architect@Distributed System of Scientific Collections ( PhD in Sociology. Bachelor's in Math and CS from the University of Illinois.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store