Comparing DNA sequences as strings

In the beginning, was the string

Image source: https://en.wikipedia.org/wiki/DNA
Image source:https://en.wikipedia.org/wiki/DNA

Keep it simple

$ echo -n "tacgctgtta tccctaaagt"|md5
d279afa8c74582037ab7cd72b77c0664

Data sources

wget 'https://www.ebi.ac.uk/ena/data/view/AM076944&display=fasta&download=fasta&filename=AM076944.fasta' -O ena.fasta$cat ena.fast 
>ENA|AM076944|AM076944.1 Stylodactylus serratus mitochondrial partial 16S rRNA gene
TACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAATAATCTA
TTTATAAATATTTAATTTATAAAACAGTTAAAAATTTTATTGGGGCCGCCCCAGCCAAAC
AACTTGTATTTAATCCCATTTAATATAAATTTAAAAACTAAAATTCACTTGTAAAGTTTT
ATAGGGTCTTATCGTCCCTTCAGTTTATTTAAGCCTTTTCACTTAAAAGTAAAGTTTAAA
TTACACCAGTAAGACAGCTTCCCTTTTGTTCAACCATTCATTCCAGCCTCCAATTAAGAG
wget 'https://www.ncbi.nlm.nih.gov/search/api/sequence/AM076944/?report=fasta' -O  ncbi.fasta>AM076944.1 Stylodactylus serratus mitochondrial partial 16S rRNA geneTACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAATAATCTATTTATAAATATTTAATTTATAAAACAGTTAAAAATTTTATTGGGGCCGCCCCAGCCAAACAACTTGTATTTAATCCCATTTAATATAAATTTAAAAACTAAAATTCACTTGTAAAGTTTTATAGGGTCTTATCGTCCCTTCAGTTTATTTAAGCCTTTTCACTTAAAAGTAAAGTTTAAATTACACCAGTAAGACAGCTTCCCTTTTGTTCAACCATTCATTCCAGCCTCCAATTAAGAG

Biopython

>>> from Bio import SeqIO
>>> enarecord = SeqIO.read("ena.fasta", "fasta")
>>> enarecord.seq
Seq('TACGCTGTTATCCCTAAAGTAACTTATACTTTTAATCCTTAAAAAAGGATCAAT...GAG', SingleLetterAlphabet())
>>> enarecord.seq==ncbirecord.seq
True
>>> hash(str(enarecord.seq))
4043515072338089544
>>> hash(str(ncbirecord.seq))
4043515072338089544

--

--

--

Data Architect@Distributed System of Scientific Collections (https://dissco.eu). PhD in Sociology. Bachelor's in Math and CS from the University of Illinois.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sharif Islam

Sharif Islam

Data Architect@Distributed System of Scientific Collections (https://dissco.eu). PhD in Sociology. Bachelor's in Math and CS from the University of Illinois.

More from Medium

Enabling SASI index

Hybrid Spatial Data Structure based on KD-tree and Quadtree

How Feature selection techniques for machine learning are important?

dropping-columns

Ray Quick Hands On