Quick background - the evasin family

Evasins are a family of secreted proteins found in ticks.

They bind to cytokines and modulate the host immune response.

The two well studied evasins, Evasin-1 and Evasin-4 from Rhipicephalus sanguineus are residues 114 and 127 residues long respectively.

The family contains a conserved patterns of four disulpide bonds.

The three-dimensional structure of Evasin-1 from Rhipicephalus sanguineus is known (PDB ID: 3FPR).

Just for fun, let's take a look.

>>> 
# If we start a line with `!`, Jupyter runs a Unix shell command
# This is a special Jupyter feature, not regular Python
!pip install nglview
Requirement already satisfied: nglview in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages
Requirement already satisfied: numpy in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nglview)
Requirement already satisfied: ipywidgets==7 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nglview)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipywidgets==7->nglview)
Requirement already satisfied: widgetsnbextension~=3.0.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipywidgets==7->nglview)
Requirement already satisfied: ipykernel>=4.5.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipywidgets==7->nglview)
Requirement already satisfied: traitlets>=4.3.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipywidgets==7->nglview)
Requirement already satisfied: nbformat>=4.2.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipywidgets==7->nglview)
Requirement already satisfied: setuptools>=18.5 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: pygments in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: decorator in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: pexpect; sys_platform != "win32" in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: simplegeneric>0.8 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: jedi>=0.10 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: appnope; sys_platform == "darwin" in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: pickleshare in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: notebook>=4.4.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: jupyter-client in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipykernel>=4.5.1->ipywidgets==7->nglview)
Requirement already satisfied: tornado>=4.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from ipykernel>=4.5.1->ipywidgets==7->nglview)
Requirement already satisfied: six in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from traitlets>=4.3.1->ipywidgets==7->nglview)
Requirement already satisfied: ipython-genutils in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from traitlets>=4.3.1->ipywidgets==7->nglview)
Requirement already satisfied: jupyter-core in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbformat>=4.2.0->ipywidgets==7->nglview)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbformat>=4.2.0->ipywidgets==7->nglview)
Requirement already satisfied: ptyprocess>=0.5 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: parso==0.1.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: wcwidth in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0; python_version >= "3.3"->ipywidgets==7->nglview)
Requirement already satisfied: terminado>=0.3.3; sys_platform != "win32" in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: nbconvert in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: jinja2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: pyzmq>=13 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets==7->nglview)
Requirement already satisfied: python-dateutil>=2.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets==7->nglview)
Requirement already satisfied: mistune>=0.7.4 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: pandocfilters>=1.4.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: bleach in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: entrypoints>=0.2.2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: testpath in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: MarkupSafe>=0.23 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: html5lib!=1.0b1,!=1.0b2,!=1.0b3,!=1.0b4,!=1.0b5,!=1.0b6,!=1.0b7,!=1.0b8,>=0.99999999pre in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
Requirement already satisfied: webencodings in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from html5lib!=1.0b1,!=1.0b2,!=1.0b3,!=1.0b4,!=1.0b5,!=1.0b6,!=1.0b7,!=1.0b8,>=0.99999999pre->bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.0.0->ipywidgets==7->nglview)
>>> 
import urllib
import nglview as nv

pdb_id = '3FPR'

# We can download the coordinates from the PDB 
# and save them to a local file like this
pdb_fn = "%s.pdb" % pdb_id
urllib.request.urlretrieve("https://files.rcsb.org/download/%s" % pdb_fn, filename=pdb_fn)

view = nv.show_file('3FPR.pdb')

view.representations = []

# Show the bonds in the cysteines as thick cylinders
view.add_licorice('CYS')
# Show the backbone as a cartoon, coloured by secondary structure type
view.add_cartoon(color='sstruc')

# Hint: You can rotate by holding the left mouse button and zoom with the scroll wheel.
view.display()

Failed to display Jupyter Widget of type NGLWidget.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

Let's run a BLAST search against the locally installed Uniprot database to find protein sequences similar to Evasin-1 and Evasin-4.

evasins_canonical.fasta contains the sequences for Evasin-1 and Evasin-4. We know from prior work that they are paralogous, so we will BLAST search them both and combine all the hits.

>>> 
import subprocess
database = '/mnt/references/uniprot/2017_10/uniprot_2017_10'

# Since this takes ~5 min or more to run, we've prepared the "evasins_canonical_uniprot.blast7" output file earlier.
# Please don't run it, if everyone does you'll probably bring down the Jupyter server !

# process = subprocess.run('blastp -query evasins_canonical.fasta -out evasins_canonical_uniprot.blast7 -outfmt 7 -evalue 10 -db %s' % database, 
#                          shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

# print(process)
>>> 
# Instead, we will just download the results prepared earlier
from urllib.request import urlretrieve
github_base = "https://raw.githubusercontent.com/MonashBioinformaticsPlatform/intro-to-python-for-bioinformatics/master/notebooks"
urlretrieve("%s/evasins_canonical.fasta" % github_base, filename="evasins_canonical.fasta")
urlretrieve("%s/evasins_canonical_uniprot.blast7" % github_base, filename="evasins_canonical_uniprot.blast7")
# urllib.request.urlretrieve("%s/evasins_uniprot_hits.fasta.complete" % github_base, filename="evasins_uniprot_hits.fasta")
Out[45]:
('evasins_uniprot_hits.fasta', <http.client.HTTPMessage at 0x113926d30>)

BLAST+ format 7 (-outfmt 7) looks like this:

>>> 
!head evasins_canonical_uniprot.blast7
# BLASTP 2.2.31+
# Query: sp|P0C8E7|EVA1_RHISA Evasin-1 OS=Rhipicephalus sanguineus PE=1 SV=1
# Database: /mnt/references/uniprot/2017_10/uniprot_2017_10
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 123 hits found
sp|P0C8E7|EVA1_RHISA	sp|P0C8E7|EVA1_RHISA	100.00	114	0	0	1	114	1	114	9e-78	237
sp|P0C8E7|EVA1_RHISA	tr|C9W1N8|C9W1N8_RHISA	99.12	114	1	0	1	114	1	114	3e-77	235
sp|P0C8E7|EVA1_RHISA	tr|A0A131Z694|A0A131Z694_RHIAP	57.01	107	41	2	13	114	19	125	5e-34	125
sp|P0C8E7|EVA1_RHISA	tr|A0A131ZAV1|A0A131ZAV1_RHIAP	64.44	90	31	1	23	111	18	107	6e-34	125
sp|P0C8E7|EVA1_RHISA	tr|A0A131X8U1|A0A131X8U1_9ACAR	56.18	89	38	1	25	113	1	88	3e-30	115

The first few lines (#) contain metadata about the search that we can skip for now. Each BLAST hit is a single line with tab-seperated fields. The second field is the sequence ID we want. We will grab all the hits irrespective of e-value.

First we will parse the evasins_canonical_uniprot.blast7 file to create a list of hits.

Once we have the IDs of each hit, we can use blastdbcmd to extract the full length sequences from the BLAST database.

>>> 
blast_hits = []
with open('evasins_canonical_uniprot.blast7', 'r') as fh:
    for line in fh:
        if line[0] != '#':
            blast_hits.append(line.split('\t')[1])

blast_hits[0:5]
Out[47]:
['sp|P0C8E7|EVA1_RHISA',
 'tr|C9W1N8|C9W1N8_RHISA',
 'tr|A0A131Z694|A0A131Z694_RHIAP',
 'tr|A0A131ZAV1|A0A131ZAV1_RHIAP',
 'tr|A0A131X8U1|A0A131X8U1_9ACAR']

Challenge

Rather than extracting the ID of the hits, can you also instead parse the E-value field and create a list of E-values ?

Protip: The scikit-bio package has a BLAST parser that captures all the fields into a Pandas dataframe. The BioPython package can also parse BLAST XML output. If I wasn't trying to demonstrate basic file parsing, I'd probably use one of those packages instead.

http://scikit-bio.org/docs/latest/generated/skbio.io.format.blast7.html#module-skbio.io.format.blast7

We want to ensure there are no duplicate IDs in our list (since the blastp command was supplied with two query sequences, Evasin-1 and Evasin-4, the evasins_canonical_uniprot.blast7 output will contain the results of two BLAST searches).

If we turn the list into a set, Python will automatically discard duplicate IDs.

>>> 
print("Total number of BLAST hits:", len(blast_hits))
Total number of BLAST hits: 149
>>> 
blast_hits = set(blast_hits)
>>> 
print("Number of hits with duplicates removed:", len(blast_hits))
Number of hits with duplicates removed: 132

Now we use blastdbcmd with out list of unique IDs to extract the full FASTA format sequence from the BLAST database.

>>> 
# We redirect to a file with '>' using the Unix shell command, rather than 
# capturing stdout in process.stdout
cmd = 'blastdbcmd -db %s -entry "%s" >evasins_uniprot_hits.fasta' % (database, ','.join(blast_hits))
cmd
Out[52]:
'blastdbcmd -db /mnt/references/uniprot/2017_10/uniprot_2017_10 -entry "tr|G3MIM7|G3MIM7_9ACAR,tr|L7MC72|L7MC72_9ACAR,tr|A0A131X8U1|A0A131X8U1_9ACAR,tr|A0A224YCD7|A0A224YCD7_9ACAR,tr|A0A023FG08|A0A023FG08_9ACAR,tr|A0A224YCA7|A0A224YCA7_9ACAR,tr|A0A023G506|A0A023G506_9ACAR,tr|A0A131YUQ8|A0A131YUQ8_RHIAP,tr|A0A023FDQ9|A0A023FDQ9_9ACAR,tr|A0A023FQ56|A0A023FQ56_9ACAR,tr|A0A023G065|A0A023G065_9ACAR,sp|P0C8E9|EVA4_RHISA,tr|A0A023G9R1|A0A023G9R1_9ACAR,tr|L7M8Z8|L7M8Z8_9ACAR,tr|A0A023FG84|A0A023FG84_9ACAR,tr|A0A1E1X165|A0A1E1X165_9ACAR,tr|A0A131YQC0|A0A131YQC0_RHIAP,tr|A0A023FQ59|A0A023FQ59_9ACAR,tr|A0A131YNV9|A0A131YNV9_RHIAP,tr|A0A023FD15|A0A023FD15_9ACAR,tr|A0A023FC49|A0A023FC49_9ACAR,tr|A0A1E1XT85|A0A1E1XT85_9ACAR,tr|A0A0C9S4J0|A0A0C9S4J0_AMBAM,tr|A0A131YRX0|A0A131YRX0_RHIAP,tr|A0A023G8N3|A0A023G8N3_9ACAR,tr|L7MC74|L7MC74_9ACAR,tr|A0A131YT06|A0A131YT06_RHIAP,tr|A0A1E1WXB0|A0A1E1WXB0_9ACAR,tr|A0A023FBW7|A0A023FBW7_9ACAR,tr|A0A023FSY8|A0A023FSY8_9ACAR,tr|A0A023FF05|A0A023FF05_9ACAR,tr|A0A023GCB8|A0A023GCB8_9ACAR,tr|G3MSY7|G3MSY7_9ACAR,tr|A0A131XLS7|A0A131XLS7_9ACAR,tr|A0A023G6B6|A0A023G6B6_9ACAR,tr|G3MSY6|G3MSY6_9ACAR,tr|A0A023FCJ0|A0A023FCJ0_9ACAR,tr|A0A023FEC6|A0A023FEC6_9ACAR,tr|A0A0C9SAC5|A0A0C9SAC5_AMBAM,tr|A0A023FF01|A0A023FF01_9ACAR,tr|A0A224Y3C9|A0A224Y3C9_9ACAR,tr|C9W1Q9|C9W1Q9_RHISA,tr|A0A0F3KQC8|A0A0F3KQC8_9NEIS,tr|A0A0C9S461|A0A0C9S461_AMBAM,tr|A0A023G2I5|A0A023G2I5_9ACAR,tr|C9W1N8|C9W1N8_RHISA,tr|A0A1E1WWB2|A0A1E1WWB2_9ACAR,tr|G3MIX6|G3MIX6_9ACAR,tr|A0A023FSX6|A0A023FSX6_9ACAR,tr|A0A023FCU0|A0A023FCU0_9ACAR,tr|A0A0C9RX43|A0A0C9RX43_AMBAM,tr|A0A224Y913|A0A224Y913_9ACAR,tr|A0A023FZW1|A0A023FZW1_9ACAR,tr|L7MC83|L7MC83_9ACAR,tr|A0A023FQ58|A0A023FQ58_9ACAR,tr|A0A0C9R5A0|A0A0C9R5A0_AMBAM,tr|A0A023G9R6|A0A023G9R6_9ACAR,tr|A0A131YHW9|A0A131YHW9_RHIAP,tr|C9W1Q7|C9W1Q7_RHISA,tr|A0A131Z694|A0A131Z694_RHIAP,tr|A0A023FFD0|A0A023FFD0_9ACAR,tr|A0A023FG15|A0A023FG15_9ACAR,tr|A0A131XJ87|A0A131XJ87_9ACAR,tr|A0A023FZ70|A0A023FZ70_9ACAR,tr|A0A023FC58|A0A023FC58_9ACAR,tr|A0A023G5I5|A0A023G5I5_9ACAR,tr|A0A023G4E4|A0A023G4E4_9ACAR,tr|A0A023FC51|A0A023FC51_9ACAR,tr|A0A1E1X0E4|A0A1E1X0E4_9ACAR,tr|A0A0C9RVT5|A0A0C9RVT5_AMBAM,tr|A0A131Z2K5|A0A131Z2K5_RHIAP,tr|A0A0C9R689|A0A0C9R689_AMBAM,tr|A0A023FQC1|A0A023FQC1_9ACAR,tr|A0A023GD11|A0A023GD11_9ACAR,tr|A0A223FZ22|A0A223FZ22_RHIMP,tr|A0A023FC02|A0A023FC02_9ACAR,tr|A0A023G9N2|A0A023G9N2_9ACAR,tr|A0A131Z5M0|A0A131Z5M0_RHIAP,tr|A0A131Z6L7|A0A131Z6L7_RHIAP,tr|A0A0C9S3J8|A0A0C9S3J8_AMBAM,tr|A0A131YFL3|A0A131YFL3_RHIAP,tr|A0A023FFB5|A0A023FFB5_9ACAR,tr|A0A131Z586|A0A131Z586_RHIAP,tr|A0A023FDY8|A0A023FDY8_9ACAR,tr|A0A1E1X1C4|A0A1E1X1C4_9ACAR,tr|A0A023FRB3|A0A023FRB3_9ACAR,tr|A0A223FZ27|A0A223FZ27_RHIMP,tr|A0A224YE88|A0A224YE88_9ACAR,tr|A0A1E5SLG8|A0A1E5SLG8_9BACT,tr|A0A023G9N9|A0A023G9N9_9ACAR,tr|A0A023FC52|A0A023FC52_9ACAR,tr|A0A023FED6|A0A023FED6_9ACAR,tr|A0A023FCX2|A0A023FCX2_9ACAR,tr|A0A023FE33|A0A023FE33_9ACAR,tr|A0A023G2G7|A0A023G2G7_9ACAR,tr|A0A023GCC3|A0A023GCC3_9ACAR,tr|A0A023FQ65|A0A023FQ65_9ACAR,tr|A0A023FT53|A0A023FT53_9ACAR,tr|W6KVX4|W6KVX4_9TRYP,tr|G3MJ83|G3MJ83_9ACAR,tr|A0A023FRC3|A0A023FRC3_9ACAR,tr|A0A023FQ55|A0A023FQ55_9ACAR,tr|A0A023G4G4|A0A023G4G4_9ACAR,tr|A0A224YE69|A0A224YE69_9ACAR,tr|A0A131Z5R5|A0A131Z5R5_RHIAP,tr|A0A131Z6N2|A0A131Z6N2_RHIAP,tr|A0A023G2M2|A0A023G2M2_9ACAR,tr|A0A0C9R5W8|A0A0C9R5W8_AMBAM,tr|A0A023G9N5|A0A023G9N5_9ACAR,tr|A0A131YS46|A0A131YS46_RHIAP,sp|P0C8E7|EVA1_RHISA,tr|A0A023G2D8|A0A023G2D8_9ACAR,tr|G3MSY5|G3MSY5_9ACAR,tr|A0A224Y3D0|A0A224Y3D0_9ACAR,tr|A0A224YCT0|A0A224YCT0_9ACAR,tr|A0A224Y3D8|A0A224Y3D8_9ACAR,tr|A0A023GD07|A0A023GD07_9ACAR,tr|A0A224YE79|A0A224YE79_9ACAR,tr|L7MAB0|L7MAB0_9ACAR,tr|C9W1C3|C9W1C3_RHISA,tr|A0A023FDV6|A0A023FDV6_9ACAR,tr|L7LT68|L7LT68_9ACAR,tr|L7M8Y5|L7M8Y5_9ACAR,tr|A0A023FSY4|A0A023FSY4_9ACAR,tr|A0A0C9SDE2|A0A0C9SDE2_AMBAM,tr|A0A023G9Q7|A0A023G9Q7_9ACAR,tr|A0A131ZAV1|A0A131ZAV1_RHIAP,tr|A0A023G159|A0A023G159_9ACAR,tr|A0A131X968|A0A131X968_9ACAR,tr|A0A023G109|A0A023G109_9ACAR,tr|A0A023FTK5|A0A023FTK5_9ACAR,tr|L7MA80|L7MA80_9ACAR" >evasins_uniprot_hits.fasta'
>>> 
import subprocess

process = subprocess.run(cmd, shell=True, stderr=subprocess.PIPE)

# If you'd like to skip this step (eg if you don't have blastdbcmd or a local Uniprot BLAST database available),
# you can run:
#
# !cp evasins_uniprot_hits.fasta.complete evasins_uniprot_hits.fasta
#
# to create a copy of the expected output sequences.
>>> 
# Take a peek at the output of our `blastdbcmd`, in the file evasin1_uniprot.fasta
!head evasins_uniprot_hits.fasta
>tr|A0A023G8N3|A0A023G8N3_9ACAR Putative evasin 1 OS=Amblyomma triste PE=2 SV=1
MTTTTSATLIFLLYVSQLLVGFSRNEASDPPQTDEDCEYYDPAVDNITCTIQSLNTTGDPIPVGCLATCENSTRYLPNGT
ECLGISQHVANRMQGNVNYTCPVGLCYRGVCQRNGLGIDCWHDTPPPNSTNVTTKAPTTLTSGRDL
>tr|A0A023G9R1|A0A023G9R1_9ACAR Putative evasin 1 OS=Amblyomma triste PE=2 SV=1
MTTTSAALIFLLYISQLLVVFSGNEVSDPPLIDEDCEYYDPSEDNITCTIRSLNTTGRPIPVGCLAMCENSTRRLHNGTE
CLGISDKVANRMQGNVTYTCPVGLCYRGVCDRNGLGIDCWHNTPPPNSTNVTTNASTTSLPTSSRDL
>tr|A0A023G4G4|A0A023G4G4_9ACAR Putative secreted protein OS=Amblyomma triste PE=2 SV=1
MSSQSMFQLFLFVCGVAASVSANTDQSLLSCTSHKVLVMTNLGPFQVGCTQPCSNMKGMAAISEDGEECIDITRDGARKM
PRRLEHRCPTGQCIKGTCKPDDLQVLCWYTGKDDATTPPRGT
>tr|A0A023FCX2|A0A023FCX2_9ACAR Putative secreted mucin OS=Amblyomma cajennense PE=2 SV=1

Using the Python ecosystem: FASTA parsing with skbio.io

http://scikit-bio.org/docs/latest/io.html

Since you know how to parse a file now, you could go ahead and write a FASTA format parser to read in our new set of putative evasin sequences. However, for most common file formats a parser already exists to make your life easier. We are going to use the one scikit-bio provides (Bio.SeqIO from Biopython would be another reasonable option).

If you are curious, you'll find a quick-n-dirty example of a hand written FASTA parser in the DIY_FASTA.ipynb notebook.

>>> 
!pip install scikit-bio==0.5.1
Requirement already satisfied: scikit-bio==0.5.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages
Requirement already satisfied: numpy>=1.9.2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: pandas>=0.18.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: natsort>=4.0.3 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: IPython>=3.2.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: scipy>=0.15.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: CacheControl>=0.11.5 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: nose>=1.3.7 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: decorator>=3.4.2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: matplotlib>=1.4.3 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: lockfile>=0.10.2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from scikit-bio==0.5.1)
Requirement already satisfied: pytz>=2011k in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pandas>=0.18.0->scikit-bio==0.5.1)
Requirement already satisfied: python-dateutil>=2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pandas>=0.18.0->scikit-bio==0.5.1)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: pygments in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: pickleshare in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: traitlets>=4.2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: jedi>=0.10 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: setuptools>=18.5 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: simplegeneric>0.8 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: pexpect; sys_platform != "win32" in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: appnope; sys_platform == "darwin" in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: msgpack-python in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from CacheControl>=0.11.5->scikit-bio==0.5.1)
Requirement already satisfied: requests in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from CacheControl>=0.11.5->scikit-bio==0.5.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib>=1.4.3->scikit-bio==0.5.1)
Requirement already satisfied: cycler>=0.10 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib>=1.4.3->scikit-bio==0.5.1)
Requirement already satisfied: six>=1.10 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from matplotlib>=1.4.3->scikit-bio==0.5.1)
Requirement already satisfied: wcwidth in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from prompt-toolkit<2.0.0,>=1.0.4->IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: ipython-genutils in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from traitlets>=4.2->IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: parso==0.1.0 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from jedi>=0.10->IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: ptyprocess>=0.5 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from pexpect; sys_platform != "win32"->IPython>=3.2.0->scikit-bio==0.5.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from requests->CacheControl>=0.11.5->scikit-bio==0.5.1)
Requirement already satisfied: idna<2.7,>=2.5 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from requests->CacheControl>=0.11.5->scikit-bio==0.5.1)
Requirement already satisfied: certifi>=2017.4.17 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from requests->CacheControl>=0.11.5->scikit-bio==0.5.1)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from requests->CacheControl>=0.11.5->scikit-bio==0.5.1)
>>> 
import skbio.io
>>> 
seqs = []
with open('evasins_uniprot_hits.fasta') as fh:
    for seq in skbio.io.read(fh, format='fasta'):
        seqs.append(seq)
>>> 
len(seqs)
Out[58]:
132
>>> 
print(repr(seqs[0]))
Sequence
---------------------------------------------------------------------
Metadata:
    'description': 'Putative evasin 1 OS=Amblyomma triste PE=2 SV=1'
    'id': 'tr|A0A023G8N3|A0A023G8N3_9ACAR'
Stats:
    length: 146
---------------------------------------------------------------------
0   MTTTTSATLI FLLYVSQLLV GFSRNEASDP PQTDEDCEYY DPAVDNITCT IQSLNTTGDP
60  IPVGCLATCE NSTRYLPNGT ECLGISQHVA NRMQGNVNYT CPVGLCYRGV CQRNGLGIDC
120 WHDTPPPNST NVTTKAPTTL TSGRDL
>>> 
type(seqs[0])
Out[60]:
skbio.sequence._sequence.Sequence
>>> 
help(seqs[0])
Help on Sequence in module skbio.sequence._sequence object:

class Sequence(skbio.metadata._mixin.MetadataMixin, skbio.metadata._mixin.PositionalMetadataMixin, skbio.metadata._mixin.IntervalMetadataMixin, collections.abc.Sequence, skbio._base.SkbioObject)
 |  Store generic sequence data and optional associated metadata.
 |  
 |  ``Sequence`` objects do not enforce an alphabet or grammar and are thus the
 |  most generic objects for storing sequence data. ``Sequence`` objects do not
 |  necessarily represent biological sequences. For example, ``Sequence`` can
 |  be used to represent a position in a multiple sequence alignment.
 |  Subclasses ``DNA``, ``RNA``, and ``Protein`` enforce the IUPAC character
 |  set [1]_ for, and provide operations specific to, each respective molecule
 |  type.
 |  
 |  ``Sequence`` objects consist of the underlying sequence data, as well
 |  as optional metadata and positional metadata. The underlying sequence
 |  is immutable, while the metdata and positional metadata are mutable.
 |  
 |  Parameters
 |  ----------
 |  sequence : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |      Characters representing the sequence itself.
 |  metadata : dict, optional
 |      Arbitrary metadata which applies to the entire sequence. A shallow copy
 |      of the ``dict`` will be made (see Examples section below for details).
 |  positional_metadata : pd.DataFrame consumable, optional
 |      Arbitrary per-character metadata (e.g., sequence read quality
 |      scores). Must be able to be passed directly to ``pd.DataFrame``
 |      constructor. Each column of metadata must be the same length as
 |      `sequence`. A shallow copy of the positional metadata will be made if
 |      necessary (see Examples section below for details).
 |  interval_metadata : IntervalMetadata
 |      Arbitrary metadata which applies to intervals within a sequence to
 |      store interval features (such as genes, ncRNA on the sequence).
 |  lowercase : bool or str, optional
 |      If ``True``, lowercase sequence characters will be converted to
 |      uppercase characters. If ``False``, no characters will be converted.
 |      If a str, it will be treated as a key into the positional metadata of
 |      the object. All lowercase characters will be converted to uppercase,
 |      and a ``True`` value will be stored in a boolean array in the
 |      positional metadata under the key.
 |  
 |  Attributes
 |  ----------
 |  values
 |  metadata
 |  positional_metadata
 |  interval_metadata
 |  observed_chars
 |  
 |  See Also
 |  --------
 |  DNA
 |  RNA
 |  Protein
 |  
 |  References
 |  ----------
 |  .. [1] Nomenclature for incompletely specified bases in nucleic acid
 |     sequences: recommendations 1984.
 |     Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030.
 |     A Cornish-Bowden
 |  
 |  Examples
 |  --------
 |  >>> from pprint import pprint
 |  >>> from skbio import Sequence
 |  >>> from skbio.metadata import IntervalMetadata
 |  
 |  **Creating sequences:**
 |  
 |  Create a sequence without any metadata:
 |  
 |  >>> seq = Sequence('GGUCGUGAAGGA')
 |  >>> seq
 |  Sequence
 |  ---------------
 |  Stats:
 |      length: 12
 |  ---------------
 |  0 GGUCGUGAAG GA
 |  
 |  Create a sequence with metadata and positional metadata:
 |  
 |  >>> metadata = {'id':'seq-id', 'desc':'seq desc', 'authors': ['Alice']}
 |  >>> positional_metadata = {'quality': [3, 3, 4, 10],
 |  ...                        'exons': [True, True, False, True]}
 |  >>> interval_metadata = IntervalMetadata(4)
 |  >>> interval = interval_metadata.add([(1, 3)], metadata={'gene': 'sagA'})
 |  >>> seq = Sequence('ACGT', metadata=metadata,
 |  ...                positional_metadata=positional_metadata,
 |  ...                interval_metadata=interval_metadata)
 |  >>> seq
 |  Sequence
 |  -----------------------------
 |  Metadata:
 |      'authors': <class 'list'>
 |      'desc': 'seq desc'
 |      'id': 'seq-id'
 |  Positional metadata:
 |      'exons': <dtype: bool>
 |      'quality': <dtype: int64>
 |  Interval metadata:
 |      1 interval feature
 |  Stats:
 |      length: 4
 |  -----------------------------
 |  0 ACGT
 |  
 |  **Retrieving underlying sequence data:**
 |  
 |  Retrieve underlying sequence:
 |  
 |  >>> seq.values # doctest: +NORMALIZE_WHITESPACE
 |  array([b'A', b'C', b'G', b'T'],
 |        dtype='|S1')
 |  
 |  Underlying sequence immutable:
 |  
 |  >>> seq.values = np.array([b'T', b'C', b'G', b'A'], dtype='|S1')
 |  Traceback (most recent call last):
 |      ...
 |  AttributeError: can't set attribute
 |  
 |  >>> seq.values[0] = b'T'
 |  Traceback (most recent call last):
 |      ...
 |  ValueError: assignment destination is read-only
 |  
 |  **Retrieving sequence metadata:**
 |  
 |  Retrieve metadata:
 |  
 |  >>> pprint(seq.metadata) # using pprint to display dict in sorted order
 |  {'authors': ['Alice'], 'desc': 'seq desc', 'id': 'seq-id'}
 |  
 |  Retrieve positional metadata:
 |  
 |  >>> seq.positional_metadata
 |     exons  quality
 |  0   True        3
 |  1   True        3
 |  2  False        4
 |  3   True       10
 |  
 |  Retrieve interval metadata:
 |  
 |  >>> seq.interval_metadata   # doctest: +ELLIPSIS
 |  1 interval feature
 |  ------------------
 |  Interval(interval_metadata=<...>, bounds=[(1, 3)], fuzzy=[(False, False)], metadata={'gene': 'sagA'})
 |  
 |  **Updating sequence metadata:**
 |  
 |  .. warning:: Be aware that a shallow copy of ``metadata`` and
 |     ``positional_metadata`` is made for performance. Since a deep copy is
 |     not made, changes made to mutable Python objects stored as metadata may
 |     affect the metadata of other ``Sequence`` objects or anything else that
 |     shares a reference to the object. The following examples illustrate this
 |     behavior.
 |  
 |  First, let's create a sequence and update its metadata:
 |  
 |  >>> metadata = {'id':'seq-id', 'desc':'seq desc', 'authors': ['Alice']}
 |  >>> seq = Sequence('ACGT', metadata=metadata)
 |  >>> seq.metadata['id'] = 'new-id'
 |  >>> seq.metadata['pubmed'] = 12345
 |  >>> pprint(seq.metadata)
 |  {'authors': ['Alice'], 'desc': 'seq desc', 'id': 'new-id', 'pubmed': 12345}
 |  
 |  Note that the original metadata dictionary (stored in variable
 |  ``metadata``) hasn't changed because a shallow copy was made:
 |  
 |  >>> pprint(metadata)
 |  {'authors': ['Alice'], 'desc': 'seq desc', 'id': 'seq-id'}
 |  >>> seq.metadata == metadata
 |  False
 |  
 |  Note however that since only a *shallow* copy was made, updates to mutable
 |  objects will also change the original metadata dictionary:
 |  
 |  >>> seq.metadata['authors'].append('Bob')
 |  >>> seq.metadata['authors']
 |  ['Alice', 'Bob']
 |  >>> metadata['authors']
 |  ['Alice', 'Bob']
 |  
 |  This behavior can also occur when manipulating a sequence that has been
 |  derived from another sequence:
 |  
 |  >>> subseq = seq[1:3]
 |  >>> subseq
 |  Sequence
 |  -----------------------------
 |  Metadata:
 |      'authors': <class 'list'>
 |      'desc': 'seq desc'
 |      'id': 'new-id'
 |      'pubmed': 12345
 |  Stats:
 |      length: 2
 |  -----------------------------
 |  0 CG
 |  >>> pprint(subseq.metadata)
 |  {'authors': ['Alice', 'Bob'],
 |   'desc': 'seq desc',
 |   'id': 'new-id',
 |   'pubmed': 12345}
 |  
 |  The subsequence has inherited the metadata of its parent sequence. If we
 |  update the subsequence's author list, we see the changes propagated in the
 |  parent sequence and original metadata dictionary:
 |  
 |  >>> subseq.metadata['authors'].append('Carol')
 |  >>> subseq.metadata['authors']
 |  ['Alice', 'Bob', 'Carol']
 |  >>> seq.metadata['authors']
 |  ['Alice', 'Bob', 'Carol']
 |  >>> metadata['authors']
 |  ['Alice', 'Bob', 'Carol']
 |  
 |  The behavior for updating positional metadata is similar. Let's create a
 |  new sequence with positional metadata that is already stored in a
 |  ``pd.DataFrame``:
 |  
 |  >>> positional_metadata = pd.DataFrame(
 |  ...     {'quality': [3, 3, 4, 10], 'list': [[], [], [], []]})
 |  >>> seq = Sequence('ACGT', positional_metadata=positional_metadata)
 |  >>> seq
 |  Sequence
 |  -----------------------------
 |  Positional metadata:
 |      'list': <dtype: object>
 |      'quality': <dtype: int64>
 |  Stats:
 |      length: 4
 |  -----------------------------
 |  0 ACGT
 |  >>> seq.positional_metadata
 |    list  quality
 |  0   []        3
 |  1   []        3
 |  2   []        4
 |  3   []       10
 |  
 |  Now let's update the sequence's positional metadata by adding a new column
 |  and changing a value in another column:
 |  
 |  >>> seq.positional_metadata['gaps'] = [False, False, False, False]
 |  >>> seq.positional_metadata.loc[0, 'quality'] = 999
 |  >>> seq.positional_metadata
 |    list  quality   gaps
 |  0   []      999  False
 |  1   []        3  False
 |  2   []        4  False
 |  3   []       10  False
 |  
 |  Note that the original positional metadata (stored in variable
 |  ``positional_metadata``) hasn't changed because a shallow copy was made:
 |  
 |  >>> positional_metadata
 |    list  quality
 |  0   []        3
 |  1   []        3
 |  2   []        4
 |  3   []       10
 |  >>> seq.positional_metadata.equals(positional_metadata)
 |  False
 |  
 |  Next let's create a sequence that has been derived from another sequence:
 |  
 |  >>> subseq = seq[1:3]
 |  >>> subseq
 |  Sequence
 |  -----------------------------
 |  Positional metadata:
 |      'list': <dtype: object>
 |      'quality': <dtype: int64>
 |      'gaps': <dtype: bool>
 |  Stats:
 |      length: 2
 |  -----------------------------
 |  0 CG
 |  >>> subseq.positional_metadata
 |    list  quality   gaps
 |  0   []        3  False
 |  1   []        4  False
 |  
 |  As described above for metadata, since only a *shallow* copy was made of
 |  the positional metadata, updates to mutable objects will also change the
 |  parent sequence's positional metadata and the original positional metadata
 |  ``pd.DataFrame``:
 |  
 |  >>> subseq.positional_metadata.loc[0, 'list'].append('item')
 |  >>> subseq.positional_metadata
 |       list  quality   gaps
 |  0  [item]        3  False
 |  1      []        4  False
 |  >>> seq.positional_metadata
 |       list  quality   gaps
 |  0      []      999  False
 |  1  [item]        3  False
 |  2      []        4  False
 |  3      []       10  False
 |  >>> positional_metadata
 |       list  quality
 |  0      []        3
 |  1  [item]        3
 |  2      []        4
 |  3      []       10
 |  
 |  You can also update the interval metadata. Let's re-create a
 |  ``Sequence`` object with interval metadata at first:
 |  
 |  >>> seq = Sequence('ACGT')
 |  >>> interval = seq.interval_metadata.add(
 |  ...     [(1, 3)], metadata={'gene': 'foo'})
 |  
 |  You can update directly on the ``Interval`` object:
 |  
 |  >>> interval  # doctest: +ELLIPSIS
 |  Interval(interval_metadata=<...>, bounds=[(1, 3)], fuzzy=[(False, False)], metadata={'gene': 'foo'})
 |  >>> interval.bounds = [(0, 2)]
 |  >>> interval  # doctest: +ELLIPSIS
 |  Interval(interval_metadata=<...>, bounds=[(0, 2)], fuzzy=[(False, False)], metadata={'gene': 'foo'})
 |  
 |  You can also query and obtain the interval features you are
 |  interested and then modify them:
 |  
 |  >>> intervals = list(seq.interval_metadata.query(metadata={'gene': 'foo'}))
 |  >>> intervals[0].fuzzy = [(True, False)]
 |  >>> print(intervals[0])  # doctest: +ELLIPSIS
 |  Interval(interval_metadata=<...>, bounds=[(0, 2)], fuzzy=[(True, False)], metadata={'gene': 'foo'})
 |  
 |  Method resolution order:
 |      Sequence
 |      skbio.metadata._mixin.MetadataMixin
 |      skbio.metadata._mixin.PositionalMetadataMixin
 |      skbio.metadata._mixin.IntervalMetadataMixin
 |      collections.abc.Sequence
 |      collections.abc.Reversible
 |      collections.abc.Collection
 |      collections.abc.Sized
 |      collections.abc.Iterable
 |      collections.abc.Container
 |      skbio._base.SkbioObject
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      Returns truth value (truthiness) of sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Returns
 |      -------
 |      bool
 |          True if length of sequence is greater than 0, else False.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> bool(Sequence(''))
 |      False
 |      >>> bool(Sequence('ACGT'))
 |      True
 |  
 |  __contains__(self, subsequence)
 |      Determine if a subsequence is contained in this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      subsequence : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |          The putative subsequence.
 |      
 |      Returns
 |      -------
 |      bool
 |          Indicates whether `subsequence` is contained in this sequence.
 |      
 |      Raises
 |      ------
 |      TypeError
 |          If `subsequence` is a ``Sequence`` object with a different type
 |          than this sequence.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUCGUGAAGGA')
 |      >>> 'GGU' in s
 |      True
 |      >>> 'CCC' in s
 |      False
 |  
 |  __copy__(self)
 |      Return a shallow copy of this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      See Also
 |      --------
 |      copy
 |      
 |      Notes
 |      -----
 |      This method is equivalent to ``seq.copy(deep=False)``.
 |  
 |  __deepcopy__(self, memo)
 |      Return a deep copy of this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      See Also
 |      --------
 |      copy
 |      
 |      Notes
 |      -----
 |      This method is equivalent to ``seq.copy(deep=True)``.
 |  
 |  __eq__(self, other)
 |      Determine if this sequence is equal to another.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Sequences are equal if they are *exactly* the same type and their
 |      sequence characters, metadata, and positional metadata are the same.
 |      
 |      Parameters
 |      ----------
 |      other : Sequence
 |          Sequence to test for equality against.
 |      
 |      Returns
 |      -------
 |      bool
 |          Indicates whether this sequence is equal to `other`.
 |      
 |      Examples
 |      --------
 |      Define two ``Sequence`` objects that have the same underlying sequence
 |      of characters:
 |      
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('ACGT')
 |      >>> t = Sequence('ACGT')
 |      
 |      The two sequences are considered equal because they are the same type,
 |      their underlying sequence of characters are the same, and their
 |      optional metadata attributes (``metadata`` and ``positional_metadata``)
 |      were not provided:
 |      
 |      >>> s == t
 |      True
 |      >>> t == s
 |      True
 |      
 |      Define another sequence object with a different sequence of characters
 |      than the previous two sequence objects:
 |      
 |      >>> u = Sequence('ACGA')
 |      >>> u == t
 |      False
 |      
 |      Define a sequence with the same sequence of characters as ``u`` but
 |      with different metadata, positional metadata, and interval metadata:
 |      
 |      >>> v = Sequence('ACGA', metadata={'id': 'abc'},
 |      ...              positional_metadata={'quality':[1, 5, 3, 3]})
 |      >>> _ = v.interval_metadata.add([(0, 1)])
 |      
 |      The two sequences are not considered equal because their metadata,
 |      positional metadata, and interval metadata do not match:
 |      
 |      >>> u == v
 |      False
 |  
 |  __getitem__(self, indexable)
 |      Slice this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Notes
 |      -----
 |      This drops the ``self.interval_metadata`` from the returned
 |      new ``Sequence`` object.
 |      
 |      Parameters
 |      ----------
 |      indexable : int, slice, iterable (int and slice), 1D array_like (bool)
 |          The position(s) to return from this sequence. If `indexable` is an
 |          iterable of integers, these are assumed to be indices in the
 |          sequence to keep. If `indexable` is a 1D ``array_like`` of
 |          booleans, these are assumed to be the positions in the sequence to
 |          keep.
 |      
 |      Returns
 |      -------
 |      Sequence
 |          New sequence containing the position(s) specified by `indexable` in
 |          this sequence. Positional metadata will be sliced in the same
 |          manner and included in the returned sequence. `metadata` is
 |          included in the returned sequence.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUCGUGAAGGA')
 |      
 |      Obtain a single character from the sequence:
 |      
 |      >>> s[1]
 |      Sequence
 |      -------------
 |      Stats:
 |          length: 1
 |      -------------
 |      0 G
 |      
 |      Obtain a slice:
 |      
 |      >>> s[7:]
 |      Sequence
 |      -------------
 |      Stats:
 |          length: 5
 |      -------------
 |      0 AAGGA
 |      
 |      Obtain characters at the following indices:
 |      
 |      >>> s[[3, 4, 7, 0, 3]]
 |      Sequence
 |      -------------
 |      Stats:
 |          length: 5
 |      -------------
 |      0 CGAGC
 |      
 |      Obtain characters at positions evaluating to `True`:
 |      
 |      >>> s = Sequence('GGUCG')
 |      >>> index = [True, False, True, 'a' is 'a', False]
 |      >>> s[index]
 |      Sequence
 |      -------------
 |      Stats:
 |          length: 3
 |      -------------
 |      0 GUC
 |  
 |  __init__(self, sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False)
 |      State: Stable as of 0.4.0.
 |  
 |  __iter__(self)
 |      Iterate over positions in this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Yields
 |      ------
 |      Sequence
 |          Single character subsequence, one for each position in the
 |          sequence.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUC')
 |      >>> for c in s:
 |      ...     str(c)
 |      'G'
 |      'G'
 |      'U'
 |      'C'
 |  
 |  __len__(self)
 |      Return the number of characters in this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Returns
 |      -------
 |      int
 |          The length of this sequence.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUC')
 |      >>> len(s)
 |      4
 |  
 |  __ne__(self, other)
 |      Determine if this sequence is not equal to another.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Sequences are not equal if they are not *exactly* the same type, or
 |      their sequence characters, metadata, or positional metadata differ.
 |      
 |      Parameters
 |      ----------
 |      other : Sequence
 |          Sequence to test for inequality against.
 |      
 |      Returns
 |      -------
 |      bool
 |          Indicates whether this sequence is not equal to `other`.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('ACGT')
 |      >>> t = Sequence('ACGT')
 |      >>> s != t
 |      False
 |      >>> u = Sequence('ACGA')
 |      >>> u != t
 |      True
 |      >>> v = Sequence('ACGA', metadata={'id': 'v'})
 |      >>> u != v
 |      True
 |  
 |  __repr__(self)
 |      Return a string representation of this sequence object.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Representation includes:
 |      
 |      * sequence type
 |      * metadata keys and values: will display key/value if it is an
 |        understood type, otherwise just the type will be displayed. If it is
 |        an understood type whose representation is too long, just the type
 |        will be displayed
 |      * positional metadata: column names and column dtypes will be displayed
 |        in the order they appear in the positional metadata ``pd.DataFrame``.
 |        Column names (i.e., keys) follow the same display rules as metadata
 |        keys
 |      * interval metadata: the number of interval features will be displayed.
 |      * sequence stats (e.g., length)
 |      * up to five lines of chunked sequence data. Each line of chunked
 |        sequence data displays the current position in the sequence
 |      
 |      Returns
 |      -------
 |      str
 |          String representation of this sequence object.
 |      
 |      Notes
 |      -----
 |      Subclasses can override Sequence._repr_stats to provide custom
 |      statistics.
 |      
 |      Examples
 |      --------
 |      Short sequence without metadata:
 |      
 |      >>> from skbio import Sequence
 |      >>> from skbio.metadata._interval import IntervalMetadata
 |      >>> Sequence('ACGTAATGGATACGTAATGCA')
 |      Sequence
 |      -------------------------
 |      Stats:
 |          length: 21
 |      -------------------------
 |      0 ACGTAATGGA TACGTAATGC A
 |      
 |      Longer sequence displays first two lines and last two lines:
 |      
 |      >>> Sequence('ACGT' * 100)
 |      Sequence
 |      ---------------------------------------------------------------------
 |      Stats:
 |          length: 400
 |      ---------------------------------------------------------------------
 |      0   ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT
 |      60  ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT
 |      ...
 |      300 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT
 |      360 ACGTACGTAC GTACGTACGT ACGTACGTAC GTACGTACGT
 |      
 |      Sequence with metadata, positional metadata, and interval metadata:
 |      
 |      >>> metadata = {
 |      ...     'id': 'seq-id',
 |      ...     'description': 'description of the sequence, wrapping across '
 |      ...     'lines if it\'s too long',
 |      ...     'authors': ['Alice', 'Bob', 'Carol'],
 |      ...     'year': 2015,
 |      ...     'published': True
 |      ... }
 |      >>> positional_metadata = {
 |      ...     'quality': [3, 10, 11, 10],
 |      ...     'exons': [True, True, False, True]
 |      ... }
 |      >>> seq = Sequence('ACGT', metadata=metadata,
 |      ...          positional_metadata=positional_metadata)
 |      >>> _ = seq.interval_metadata.add([(0, 2)], metadata={'gene': 'sagA'})
 |      >>> seq
 |      Sequence
 |      ----------------------------------------------------------------------
 |      Metadata:
 |          'authors': <class 'list'>
 |          'description': "description of the sequence, wrapping across lines
 |                          if it's too long"
 |          'id': 'seq-id'
 |          'published': True
 |          'year': 2015
 |      Positional metadata:
 |          'exons': <dtype: bool>
 |          'quality': <dtype: int64>
 |      Interval metadata:
 |          1 interval feature
 |      Stats:
 |          length: 4
 |      ----------------------------------------------------------------------
 |      0 ACGT
 |  
 |  __reversed__(self)
 |      Iterate over positions in this sequence in reverse order.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Yields
 |      ------
 |      Sequence
 |          Single character subsequence, one for each position in the
 |          sequence.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUC')
 |      >>> for c in reversed(s):
 |      ...     str(c)
 |      'C'
 |      'U'
 |      'G'
 |      'G'
 |  
 |  __str__(self)
 |      Return sequence characters as a string.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Returns
 |      -------
 |      str
 |          Sequence characters as a string. No metadata or positional
 |          metadata will be included.
 |      
 |      See Also
 |      --------
 |      sequence
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUCGUAAAGGA', metadata={'id':'hello'})
 |      >>> str(s)
 |      'GGUCGUAAAGGA'
 |  
 |  count(self, subsequence, start=None, end=None)
 |      Count occurrences of a subsequence in this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      subsequence : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |          Subsequence to count occurrences of.
 |      start : int, optional
 |          The position at which to start counting (inclusive).
 |      end : int, optional
 |          The position at which to stop counting (exclusive).
 |      
 |      Returns
 |      -------
 |      int
 |          Number of occurrences of `subsequence` in this sequence.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If `subsequence` is of length 0.
 |      TypeError
 |          If `subsequence` is a ``Sequence`` object with a different type
 |          than this sequence.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUCG')
 |      >>> s.count('G')
 |      3
 |      >>> s.count('GG')
 |      1
 |      >>> s.count('T')
 |      0
 |      >>> s.count('G', 2, 5)
 |      1
 |  
 |  distance(self, other, metric=None)
 |      Compute the distance to another sequence.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      other : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |          Sequence to compute the distance to. If `other` is a ``Sequence``
 |          object, it must be the same type as this sequence. Other input
 |          types will be converted into a ``Sequence`` object of the same type
 |          as this sequence.
 |      metric : function, optional
 |          Function used to compute the distance between this sequence and
 |          `other`. If ``None`` (the default), Hamming distance will be used
 |          (:func:`skbio.sequence.distance.hamming`). `metric` should take two
 |          ``skbio.Sequence`` objects and return a ``float``. The sequence
 |          objects passed to `metric` will be the same type as this sequence.
 |          See :mod:`skbio.sequence.distance` for other predefined metrics
 |          that can be supplied via `metric`.
 |      
 |      Returns
 |      -------
 |      float
 |          Distance between this sequence and `other` as defined by `metric`.
 |      
 |      Raises
 |      ------
 |      TypeError
 |          If `other` is a ``Sequence`` object with a different type than this
 |          sequence.
 |      
 |      See Also
 |      --------
 |      skbio.sequence.distance
 |      fraction_diff
 |      fraction_same
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUC')
 |      >>> t = Sequence('AGUC')
 |      
 |      Compute Hamming distance (the default metric):
 |      
 |      >>> s.distance(t)
 |      0.25
 |      
 |      Use a custom metric:
 |      
 |      >>> def custom_metric(s1, s2): return 0.42
 |      >>> s.distance(t, custom_metric)
 |      0.42
 |  
 |  find_with_regex(self, regex, ignore=None)
 |      Generate slices for patterns matched by a regular expression.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      regex : str or regular expression object
 |          String to be compiled into a regular expression, or a pre-
 |          compiled regular expression object (e.g., from calling
 |          ``re.compile``).
 |      ignore : 1D array_like (bool) or iterable (slices or ints), optional
 |          Indicate the positions to ignore when matching.
 |      
 |      Yields
 |      ------
 |      slice
 |          Location where the regular expression matched.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('AATATACCGGTTATAA')
 |      >>> for match in s.find_with_regex('(TATA+)'):
 |      ...     match
 |      ...     str(s[match])
 |      slice(2, 6, None)
 |      'TATA'
 |      slice(11, 16, None)
 |      'TATAA'
 |  
 |  frequencies(self, chars=None, relative=False)
 |      Compute frequencies of characters in the sequence.
 |      
 |      State: Experimental as of 0.4.1.
 |      
 |      Parameters
 |      ----------
 |      chars : str or set of str, optional
 |          Characters to compute the frequencies of. May be a ``str``
 |          containing a single character or a ``set`` of single-character
 |          strings. If ``None``, frequencies will be computed for all
 |          characters present in the sequence.
 |      relative : bool, optional
 |          If ``True``, return the relative frequency of each character
 |          instead of its count. If `chars` is provided, relative frequencies
 |          will be computed with respect to the number of characters in the
 |          sequence, **not** the total count of characters observed in
 |          `chars`. Thus, the relative frequencies will not necessarily sum to
 |          1.0 if `chars` is provided.
 |      
 |      Returns
 |      -------
 |      dict
 |          Frequencies of characters in the sequence.
 |      
 |      Raises
 |      ------
 |      TypeError
 |          If `chars` is not a ``str`` or ``set`` of ``str``.
 |      ValueError
 |          If `chars` is not a single-character ``str`` or a ``set`` of
 |          single-character strings.
 |      ValueError
 |          If `chars` contains characters outside the allowable range of
 |          characters in a ``Sequence`` object.
 |      
 |      See Also
 |      --------
 |      kmer_frequencies
 |      iter_kmers
 |      
 |      Notes
 |      -----
 |      If the sequence is empty (i.e., length zero), ``relative=True``,
 |      **and** `chars` is provided, the relative frequency of each specified
 |      character will be ``np.nan``.
 |      
 |      If `chars` is not provided, this method is equivalent to, but faster
 |      than, ``seq.kmer_frequencies(k=1)``.
 |      
 |      If `chars` is not provided, it is equivalent to, but faster than,
 |      passing ``chars=seq.observed_chars``.
 |      
 |      Examples
 |      --------
 |      Compute character frequencies of a sequence:
 |      
 |      >>> from pprint import pprint
 |      >>> from skbio import Sequence
 |      >>> seq = Sequence('AGAAGACC')
 |      >>> freqs = seq.frequencies()
 |      >>> pprint(freqs) # using pprint to display dict in sorted order
 |      {'A': 4, 'C': 2, 'G': 2}
 |      
 |      Compute relative character frequencies:
 |      
 |      >>> freqs = seq.frequencies(relative=True)
 |      >>> pprint(freqs)
 |      {'A': 0.5, 'C': 0.25, 'G': 0.25}
 |      
 |      Compute relative frequencies of characters A, C, and T:
 |      
 |      >>> freqs = seq.frequencies(chars={'A', 'C', 'T'}, relative=True)
 |      >>> pprint(freqs)
 |      {'A': 0.5, 'C': 0.25, 'T': 0.0}
 |      
 |      Note that since character T is not in the sequence we receive a
 |      relative frequency of 0.0. The relative frequencies of A and C are
 |      relative to the number of characters in the sequence (8), **not** the
 |      number of A and C characters (4 + 2 = 6).
 |  
 |  index(self, subsequence, start=None, end=None)
 |      Find position where subsequence first occurs in the sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      subsequence : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |          Subsequence to search for in this sequence.
 |      start : int, optional
 |          The position at which to start searching (inclusive).
 |      end : int, optional
 |          The position at which to stop searching (exclusive).
 |      
 |      Returns
 |      -------
 |      int
 |          Position where `subsequence` first occurs in this sequence.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If `subsequence` is not present in this sequence.
 |      TypeError
 |          If `subsequence` is a ``Sequence`` object with a different type
 |          than this sequence.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('ACACGACGTT-')
 |      >>> s.index('ACG')
 |      2
 |  
 |  iter_contiguous(self, included, min_length=1, invert=False)
 |      Yield contiguous subsequences based on `included`.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      included : 1D array_like (bool) or iterable (slices or ints)
 |          `included` is transformed into a flat boolean vector where each
 |          position will either be included or skipped. All contiguous
 |          included positions will be yielded as a single region.
 |      min_length : int, optional
 |          The minimum length of a subsequence for it to be yielded.
 |          Default is 1.
 |      invert : bool, optional
 |          Whether to invert `included` such that it describes what should be
 |          skipped instead of included. Default is False.
 |      
 |      Yields
 |      ------
 |      Sequence
 |          Contiguous subsequence as indicated by `included`.
 |      
 |      Notes
 |      -----
 |      If slices provide adjacent ranges, then they will be considered the
 |      same contiguous subsequence.
 |      
 |      Examples
 |      --------
 |      Here we use `iter_contiguous` to find all of the contiguous ungapped
 |      sequences using a boolean vector derived from our DNA sequence.
 |      
 |      >>> from skbio import DNA
 |      >>> s = DNA('AAA--TT-CCCC-G-')
 |      >>> no_gaps = ~s.gaps()
 |      >>> for ungapped_subsequence in s.iter_contiguous(no_gaps,
 |      ...                                               min_length=2):
 |      ...     print(ungapped_subsequence)
 |      AAA
 |      TT
 |      CCCC
 |      
 |      Note how the last potential subsequence was skipped because it would
 |      have been smaller than our `min_length` which was set to 2.
 |      
 |      We can also use `iter_contiguous` on a generator of slices as is
 |      produced by `find_motifs` (and `find_with_regex`).
 |      
 |      >>> from skbio import Protein
 |      >>> s = Protein('ACDFNASANFTACGNPNRTESL')
 |      >>> for subseq in s.iter_contiguous(s.find_motifs('N-glycosylation')):
 |      ...     print(subseq)
 |      NASANFTA
 |      NRTE
 |      
 |      Note how the first subsequence contains two N-glycosylation sites. This
 |      happened because they were contiguous.
 |  
 |  iter_kmers(self, k, overlap=True)
 |      Generate kmers of length `k` from this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      k : int
 |          The kmer length.
 |      overlap : bool, optional
 |          Defines whether the kmers should be overlapping or not.
 |      
 |      Yields
 |      ------
 |      Sequence
 |          kmer of length `k` contained in this sequence.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If `k` is less than 1.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('ACACGACGTT')
 |      >>> for kmer in s.iter_kmers(4, overlap=False):
 |      ...     str(kmer)
 |      'ACAC'
 |      'GACG'
 |      >>> for kmer in s.iter_kmers(3, overlap=True):
 |      ...     str(kmer)
 |      'ACA'
 |      'CAC'
 |      'ACG'
 |      'CGA'
 |      'GAC'
 |      'ACG'
 |      'CGT'
 |      'GTT'
 |  
 |  kmer_frequencies(self, k, overlap=True, relative=False)
 |      Return counts of words of length `k` from this sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      k : int
 |          The word length.
 |      overlap : bool, optional
 |          Defines whether the kmers should be overlapping or not.
 |      relative : bool, optional
 |          If ``True``, return the relative frequency of each kmer instead of
 |          its count.
 |      
 |      Returns
 |      -------
 |      dict
 |          Frequencies of words of length `k` contained in this sequence.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If `k` is less than 1.
 |      
 |      Examples
 |      --------
 |      >>> from pprint import pprint
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('ACACATTTATTA')
 |      >>> freqs = s.kmer_frequencies(3, overlap=False)
 |      >>> pprint(freqs) # using pprint to display dict in sorted order
 |      {'ACA': 1, 'CAT': 1, 'TTA': 2}
 |      >>> freqs = s.kmer_frequencies(3, relative=True, overlap=False)
 |      >>> pprint(freqs)
 |      {'ACA': 0.25, 'CAT': 0.25, 'TTA': 0.5}
 |  
 |  lowercase(self, lowercase)
 |      Return a case-sensitive string representation of the sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      lowercase: str or boolean vector
 |          If lowercase is a boolean vector, it is used to set sequence
 |          characters to lowercase in the output string. True values in the
 |          boolean vector correspond to lowercase characters. If lowercase
 |          is a str, it is treated like a key into the positional metadata,
 |          pointing to a column which must be a boolean vector.
 |          That boolean vector is then used as described previously.
 |      
 |      Returns
 |      -------
 |      str
 |          String representation of sequence with specified characters set to
 |          lowercase.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('ACGT')
 |      >>> s.lowercase([True, True, False, False])
 |      'acGT'
 |      >>> s = Sequence('ACGT',
 |      ...              positional_metadata={
 |      ...                 'exons': [True, False, False, True]})
 |      >>> s.lowercase('exons')
 |      'aCGt'
 |      
 |      Constructor automatically populates a column in positional metadata
 |      when the ``lowercase`` keyword argument is provided with a column name:
 |      
 |      >>> s = Sequence('ACgt', lowercase='introns')
 |      >>> s.lowercase('introns')
 |      'ACgt'
 |      >>> s = Sequence('ACGT', lowercase='introns')
 |      >>> s.lowercase('introns')
 |      'ACGT'
 |  
 |  match_frequency(self, other, relative=False)
 |      Return count of positions that are the same between two sequences.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      other : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |          Sequence to compare to.
 |      relative : bool, optional
 |          If ``True``, return the relative frequency of matches instead of
 |          the count.
 |      
 |      Returns
 |      -------
 |      int or float
 |          Number of positions that are the same between the sequences. This
 |          will be an ``int`` if `relative` is ``False`` and a ``float``
 |          if `relative` is ``True``.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If the sequences are not the same length.
 |      TypeError
 |          If `other` is a ``Sequence`` object with a different type than this
 |          sequence.
 |      
 |      See Also
 |      --------
 |      mismatch_frequency
 |      matches
 |      mismatches
 |      distance
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUC')
 |      >>> t = Sequence('AGUC')
 |      >>> s.match_frequency(t)
 |      3
 |      >>> s.match_frequency(t, relative=True)
 |      0.75
 |  
 |  matches(self, other)
 |      Find positions that match with another sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      other : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |          Sequence to compare to.
 |      
 |      Returns
 |      -------
 |      1D np.ndarray (bool)
 |          Boolean vector where ``True`` at position ``i`` indicates a match
 |          between the sequences at their positions ``i``.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If the sequences are not the same length.
 |      TypeError
 |          If `other` is a ``Sequence`` object with a different type than this
 |          sequence.
 |      
 |      See Also
 |      --------
 |      mismatches
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUC')
 |      >>> t = Sequence('GAUU')
 |      >>> s.matches(t)
 |      array([ True, False,  True, False], dtype=bool)
 |  
 |  mismatch_frequency(self, other, relative=False)
 |      Return count of positions that differ between two sequences.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      other : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |          Sequence to compare to.
 |      relative : bool, optional
 |          If ``True``, return the relative frequency of mismatches instead of
 |          the count.
 |      
 |      Returns
 |      -------
 |      int or float
 |          Number of positions that differ between the sequences. This will be
 |          an ``int`` if `relative` is ``False`` and a ``float``
 |          if `relative` is ``True``.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If the sequences are not the same length.
 |      TypeError
 |          If `other` is a ``Sequence`` object with a different type than this
 |          sequence.
 |      
 |      See Also
 |      --------
 |      match_frequency
 |      matches
 |      mismatches
 |      distance
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUC')
 |      >>> t = Sequence('AGUC')
 |      >>> s.mismatch_frequency(t)
 |      1
 |      >>> s.mismatch_frequency(t, relative=True)
 |      0.25
 |  
 |  mismatches(self, other)
 |      Find positions that do not match with another sequence.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      other : str, Sequence, or 1D np.ndarray (np.uint8 or '\|S1')
 |          Sequence to compare to.
 |      
 |      Returns
 |      -------
 |      1D np.ndarray (bool)
 |          Boolean vector where ``True`` at position ``i`` indicates a
 |          mismatch between the sequences at their positions ``i``.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If the sequences are not the same length.
 |      TypeError
 |          If `other` is a ``Sequence`` object with a different type than this
 |          sequence.
 |      
 |      See Also
 |      --------
 |      matches
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('GGUC')
 |      >>> t = Sequence('GAUU')
 |      >>> s.mismatches(t)
 |      array([False,  True, False,  True], dtype=bool)
 |  
 |  replace(self, where, character)
 |      Replace values in this sequence with a different character.
 |      
 |      State: Experimental as of 0.5.0.
 |      
 |      Parameters
 |      ----------
 |      where : 1D array_like (bool) or iterable (slices or ints) or str
 |          Indicates positions in the sequence to replace with `character`.
 |          Can be a boolean vector, an iterable of indices/slices, or a
 |          string that is a key in `positional_metadata` pointing to a
 |          boolean vector.
 |      character : str or bytes
 |          Character that will replace chosen items in this sequence.
 |      
 |      Returns
 |      -------
 |      Sequence
 |          Copy of this sequence, with chosen items replaced with chosen
 |          character. All metadata is retained.
 |      
 |      Examples
 |      --------
 |      Let's create and display a Sequence:
 |      
 |      >>> from skbio import Sequence
 |      >>> sequence = Sequence('GGTACCAACG')
 |      >>> str(sequence)
 |      'GGTACCAACG'
 |      
 |      Let's call ``replace`` on the Sequence using a boolean vector for
 |      ``where`` and assign it to a new variable:
 |      
 |      >>> seq = sequence.replace([False, False, False, True, False, False,
 |      ...                         True, True, False, False], '-')
 |      
 |      Let's take a look at the new Sequence:
 |      
 |      >>> str(seq)
 |      'GGT-CC--CG'
 |      
 |      Other types of input are accepted by the ``where`` parameter. Let's
 |      pass in a list of indices and slices that is equivalent to the boolean
 |      vector we used previously:
 |      
 |      >>> str(seq) == str(sequence.replace([3, slice(6, 8)], '-'))
 |      True
 |      
 |      ``where`` also accepts a boolean vector contained in
 |      ``Sequence.positional_metadata``:
 |      
 |      >>> sequence.positional_metadata = {'where':
 |      ...                                 [False, False, False, True, False,
 |      ...                                  False, True, True, False, False]}
 |      
 |      Let's pass in the key ``'where'`` and compare to ``seq``:
 |      
 |      >>> str(seq) == str(sequence.replace('where', '-'))
 |      True
 |  
 |  write(self, file, format='fasta', **kwargs)
 |      Write an instance of ``Sequence`` to a file.
 |      
 |      This is a convenience method for :func:`skbio.io.registry.write`.
 |      For more information about the I/O system in scikit-bio, please
 |      see :mod:`skbio.io`.
 |      
 |      Supported file formats include:
 |      
 |      - ``'fasta'`` (:mod:`skbio.io.format.fasta`)
 |      - ``'fastq'`` (:mod:`skbio.io.format.fastq`)
 |      - ``'genbank'`` (:mod:`skbio.io.format.genbank`)
 |      
 |      Parameters
 |      ----------
 |      file : openable (filepath, URL, filehandle, etc.)
 |          The location to write the given `format` into.  Something
 |          that is understood by :func:`skbio.io.util.open`. Filehandles
 |          are not automatically closed, it is the responsibility of the
 |          caller.
 |      format : str
 |          The format must be a registered format name with a writer for
 |          ``Sequence``.
 |          Default is `'fasta'`.
 |      kwargs : dict, optional
 |          Keyword arguments passed to :func:`skbio.io.registry.write`
 |          and the file format writer.
 |      
 |      See Also
 |      --------
 |      read
 |      skbio.io.registry.write
 |      skbio.io.util.open
 |      skbio.io.format.fasta
 |      skbio.io.format.fastq
 |      skbio.io.format.genbank
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  concat(sequences, how='strict') from abc.ABCMeta
 |      Concatenate an iterable of ``Sequence`` objects.
 |      
 |      State: Experimental as of 0.4.1.
 |      
 |      Parameters
 |      ----------
 |      sequences : iterable (Sequence)
 |          An iterable of ``Sequence`` objects or appropriate subclasses.
 |      how : {'strict', 'inner', 'outer'}, optional
 |          How to intersect the `positional_metadata` of the sequences.
 |          If 'strict': the `positional_metadata` must have the exact same
 |          columns; 'inner': an inner-join of the columns (only the shared set
 |          of columns are used); 'outer': an outer-join of the columns
 |          (all columns are used: missing values will be padded with NaN).
 |      
 |      Returns
 |      -------
 |      Sequence
 |          The returned sequence will be an instance of the class which
 |          called this class-method.
 |      
 |      Raises
 |      ------
 |      ValueError
 |          If `how` is not one of: 'strict', 'inner', or 'outer'.
 |      ValueError
 |          If `how` is 'strict' and the `positional_metadata` of each sequence
 |          does not have the same columns.
 |      TypeError
 |          If the sequences cannot be cast as the calling class.
 |      
 |      Notes
 |      -----
 |      The sequence-wide metadata (``Sequence.metadata``) is not retained
 |      during concatenation.
 |      
 |      Sequence objects can be cast to a different type only when the new
 |      type is an ancestor or child of the original type. Casting between
 |      sibling types is not allowed, e.g. ``DNA`` -> ``RNA`` is not
 |      allowed, but ``DNA`` -> ``Sequence`` or ``Sequence`` -> ``DNA``
 |      would be.
 |      
 |      Examples
 |      --------
 |      Concatenate two DNA sequences into a new DNA object:
 |      
 |      >>> from skbio import DNA, Sequence
 |      >>> s1 = DNA("ACGT")
 |      >>> s2 = DNA("GGAA")
 |      >>> DNA.concat([s1, s2])
 |      DNA
 |      --------------------------
 |      Stats:
 |          length: 8
 |          has gaps: False
 |          has degenerates: False
 |          has definites: True
 |          GC-content: 50.00%
 |      --------------------------
 |      0 ACGTGGAA
 |      
 |      Concatenate DNA sequences into a Sequence object (type coercion):
 |      
 |      >>> Sequence.concat([s1, s2])
 |      Sequence
 |      -------------
 |      Stats:
 |          length: 8
 |      -------------
 |      0 ACGTGGAA
 |      
 |      Positional metadata is conserved:
 |      
 |      >>> s1 = DNA('AcgT', lowercase='one')
 |      >>> s2 = DNA('GGaA', lowercase='one',
 |      ...          positional_metadata={'two': [1, 2, 3, 4]})
 |      >>> result = DNA.concat([s1, s2], how='outer')
 |      >>> result
 |      DNA
 |      ---------------------------
 |      Positional metadata:
 |          'one': <dtype: bool>
 |          'two': <dtype: float64>
 |      Stats:
 |          length: 8
 |          has gaps: False
 |          has degenerates: False
 |          has definites: True
 |          GC-content: 50.00%
 |      ---------------------------
 |      0 ACGTGGAA
 |      >>> result.positional_metadata
 |           one  two
 |      0  False  NaN
 |      1   True  NaN
 |      2   True  NaN
 |      3  False  NaN
 |      4  False  1.0
 |      5  False  2.0
 |      6   True  3.0
 |      7  False  4.0
 |  
 |  read(file, format=None, **kwargs) from abc.ABCMeta
 |      Create a new ``Sequence`` instance from a file.
 |      
 |      This is a convenience method for :func:`skbio.io.registry.read`. For
 |      more information about the I/O system in scikit-bio, please see
 |      :mod:`skbio.io`.
 |      
 |      Supported file formats include:
 |      
 |      - ``'fasta'`` (:mod:`skbio.io.format.fasta`)
 |      - ``'fastq'`` (:mod:`skbio.io.format.fastq`)
 |      - ``'qseq'`` (:mod:`skbio.io.format.qseq`)
 |      - ``'genbank'`` (:mod:`skbio.io.format.genbank`)
 |      
 |      Parameters
 |      ----------
 |      file : openable (filepath, URL, filehandle, etc.)
 |          The location to read the given `format`. Something that is
 |          understood by :func:`skbio.io.util.open`. Filehandles are not
 |          automatically closed, it is the responsibility of the caller.
 |      format : str, optional
 |          The format must be a format name with a reader for ``Sequence``.
 |          If a `format` is not provided or is None, it will attempt to
 |          guess the format.
 |      kwargs : dict, optional
 |          Keyword arguments passed to :func:`skbio.io.registry.read` and
 |          the file format reader for ``Sequence``.
 |      
 |      Returns
 |      -------
 |      Sequence
 |          A new instance.
 |      
 |      See Also
 |      --------
 |      write
 |      skbio.io.registry.read
 |      skbio.io.util.open
 |      skbio.io.format.fasta
 |      skbio.io.format.fastq
 |      skbio.io.format.qseq
 |      skbio.io.format.genbank
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __array_interface__
 |      Array interface for compatibility with numpy.
 |      
 |      This property allows a ``Sequence`` object to share its underlying data
 |      buffer (``Sequence.values``) with numpy. See [1]_ for more details.
 |      
 |      References
 |      ----------
 |      .. [1] http://docs.scipy.org/doc/numpy/reference/arrays.interface.html
 |      
 |      Examples
 |      --------
 |      >>> import numpy as np
 |      >>> from skbio import Sequence
 |      >>> seq = Sequence('ABC123')
 |      >>> np.asarray(seq) # doctest: +NORMALIZE_WHITESPACE
 |      array([b'A', b'B', b'C', b'1', b'2', b'3'],
 |            dtype='|S1')
 |  
 |  observed_chars
 |      Set of observed characters in the sequence.
 |      
 |      State: Experimental as of 0.4.1.
 |      
 |      Notes
 |      -----
 |      This property is not writeable.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('AACGAC')
 |      >>> s.observed_chars == {'G', 'A', 'C'}
 |      True
 |  
 |  values
 |      Array containing underlying sequence characters.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Notes
 |      -----
 |      This property is not writeable.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import Sequence
 |      >>> s = Sequence('AACGA')
 |      >>> s.values # doctest: +NORMALIZE_WHITESPACE
 |      array([b'A', b'A', b'C', b'G', b'A'],
 |            dtype='|S1')
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  __hash__ = None
 |  
 |  default_write_format = 'fasta'
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from skbio.metadata._mixin.MetadataMixin:
 |  
 |  has_metadata(self)
 |      Determine if the object has metadata.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      An object has metadata if its ``metadata`` dictionary is not empty
 |      (i.e., has at least one key-value pair).
 |      
 |      Returns
 |      -------
 |      bool
 |          Indicates whether the object has metadata.
 |      
 |      Examples
 |      --------
 |      .. note:: scikit-bio objects with metadata share a common interface for
 |         accessing and manipulating their metadata. The following examples
 |         use scikit-bio's ``Sequence`` class to demonstrate metadata
 |         behavior. These examples apply to all other scikit-bio objects
 |         storing metadata.
 |      
 |      >>> from skbio import Sequence
 |      >>> seq = Sequence('ACGT')
 |      >>> seq.has_metadata()
 |      False
 |      >>> seq = Sequence('ACGT', metadata={})
 |      >>> seq.has_metadata()
 |      False
 |      >>> seq = Sequence('ACGT', metadata={'id': 'seq-id'})
 |      >>> seq.has_metadata()
 |      True
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from skbio.metadata._mixin.MetadataMixin:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  metadata
 |      ``dict`` containing metadata which applies to the entire object.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Notes
 |      -----
 |      This property can be set and deleted. When setting new metadata a
 |      shallow copy of the dictionary is made.
 |      
 |      Examples
 |      --------
 |      .. note:: scikit-bio objects with metadata share a common interface for
 |         accessing and manipulating their metadata. The following examples
 |         use scikit-bio's ``Sequence`` class to demonstrate metadata
 |         behavior. These examples apply to all other scikit-bio objects
 |         storing metadata.
 |      
 |      Create a sequence with metadata:
 |      
 |      >>> from pprint import pprint
 |      >>> from skbio import Sequence
 |      >>> seq = Sequence('ACGT', metadata={'id': 'seq-id',
 |      ...                                  'description': 'seq description'})
 |      
 |      Retrieve metadata:
 |      
 |      >>> pprint(seq.metadata) # using pprint to display dict in sorted order
 |      {'description': 'seq description', 'id': 'seq-id'}
 |      
 |      Update metadata:
 |      
 |      >>> seq.metadata['id'] = 'new-id'
 |      >>> seq.metadata['pubmed'] = 12345
 |      >>> pprint(seq.metadata)
 |      {'description': 'seq description', 'id': 'new-id', 'pubmed': 12345}
 |      
 |      Set metadata:
 |      
 |      >>> seq.metadata = {'abc': 123}
 |      >>> seq.metadata
 |      {'abc': 123}
 |      
 |      Delete metadata:
 |      
 |      >>> seq.has_metadata()
 |      True
 |      >>> del seq.metadata
 |      >>> seq.metadata
 |      {}
 |      >>> seq.has_metadata()
 |      False
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from skbio.metadata._mixin.PositionalMetadataMixin:
 |  
 |  has_positional_metadata(self)
 |      Determine if the object has positional metadata.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      An object has positional metadata if its ``positional_metadata``
 |      ``pd.DataFrame`` has at least one column.
 |      
 |      Returns
 |      -------
 |      bool
 |          Indicates whether the object has positional metadata.
 |      
 |      Examples
 |      --------
 |      .. note:: scikit-bio objects with positional metadata share a common
 |         interface for accessing and manipulating their positional metadata.
 |         The following examples use scikit-bio's ``DNA`` class to demonstrate
 |         positional metadata behavior. These examples apply to all other
 |         scikit-bio objects storing positional metadata.
 |      
 |      >>> import pandas as pd
 |      >>> from skbio import DNA
 |      >>> seq = DNA('ACGT')
 |      >>> seq.has_positional_metadata()
 |      False
 |      >>> seq = DNA('ACGT', positional_metadata=pd.DataFrame(index=range(4)))
 |      >>> seq.has_positional_metadata()
 |      False
 |      >>> seq = DNA('ACGT', positional_metadata={'quality': range(4)})
 |      >>> seq.has_positional_metadata()
 |      True
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from skbio.metadata._mixin.PositionalMetadataMixin:
 |  
 |  positional_metadata
 |      ``pd.DataFrame`` containing metadata along an axis.
 |      
 |      State: Stable as of 0.4.0.
 |      
 |      Notes
 |      -----
 |      This property can be set and deleted. When setting new positional
 |      metadata, a shallow copy is made and the ``pd.DataFrame`` index is set
 |      to ``pd.RangeIndex(start=0, stop=axis_len, step=1)``.
 |      
 |      Examples
 |      --------
 |      .. note:: scikit-bio objects with positional metadata share a common
 |         interface for accessing and manipulating their positional metadata.
 |         The following examples use scikit-bio's ``DNA`` class to demonstrate
 |         positional metadata behavior. These examples apply to all other
 |         scikit-bio objects storing positional metadata.
 |      
 |      Create a DNA sequence with positional metadata:
 |      
 |      >>> from skbio import DNA
 |      >>> seq = DNA(
 |      ...     'ACGT',
 |      ...     positional_metadata={'quality': [3, 3, 20, 11],
 |      ...                          'exons': [True, True, False, True]})
 |      >>> seq
 |      DNA
 |      -----------------------------
 |      Positional metadata:
 |          'exons': <dtype: bool>
 |          'quality': <dtype: int64>
 |      Stats:
 |          length: 4
 |          has gaps: False
 |          has degenerates: False
 |          has definites: True
 |          GC-content: 50.00%
 |      -----------------------------
 |      0 ACGT
 |      
 |      Retrieve positional metadata:
 |      
 |      >>> seq.positional_metadata
 |         exons  quality
 |      0   True        3
 |      1   True        3
 |      2  False       20
 |      3   True       11
 |      
 |      Update positional metadata:
 |      
 |      >>> seq.positional_metadata['gaps'] = seq.gaps()
 |      >>> seq.positional_metadata
 |         exons  quality   gaps
 |      0   True        3  False
 |      1   True        3  False
 |      2  False       20  False
 |      3   True       11  False
 |      
 |      Set positional metadata:
 |      
 |      >>> seq.positional_metadata = {'degenerates': seq.degenerates()}
 |      >>> seq.positional_metadata
 |        degenerates
 |      0       False
 |      1       False
 |      2       False
 |      3       False
 |      
 |      Delete positional metadata:
 |      
 |      >>> seq.has_positional_metadata()
 |      True
 |      >>> del seq.positional_metadata
 |      >>> seq.positional_metadata
 |      Empty DataFrame
 |      Columns: []
 |      Index: [0, 1, 2, 3]
 |      >>> seq.has_positional_metadata()
 |      False
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from skbio.metadata._mixin.IntervalMetadataMixin:
 |  
 |  has_interval_metadata(self)
 |      Determine if the object has interval metadata.
 |      
 |      State: Experimental as of 0.5.1.
 |      
 |      An object has interval metadata if its ``interval_metadata``
 |      has at least one ```Interval`` objects.
 |      
 |      Returns
 |      -------
 |      bool
 |          Indicates whether the object has interval metadata.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from skbio.metadata._mixin.IntervalMetadataMixin:
 |  
 |  interval_metadata
 |      ``IntervalMetadata`` object containing info about interval features.
 |      
 |      State: Experimental as of 0.5.1.
 |      
 |      Notes
 |      -----
 |      This property can be set and deleted. When setting new
 |      interval metadata, a shallow copy of the ``IntervalMetadata``
 |      object is made.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods inherited from collections.abc.Reversible:
 |  
 |  __subclasshook__(C) from abc.ABCMeta
 |      Abstract classes can override this to customize issubclass().
 |      
 |      This is invoked early on by abc.ABCMeta.__subclasscheck__().
 |      It should return True, False or NotImplemented.  If it returns
 |      NotImplemented, the normal algorithm is used.  Otherwise, it
 |      overrides the normal algorithm (and the outcome is cached).

>>> 
seqs[0].metadata
Out[62]:
{'description': 'Putative evasin 1 OS=Amblyomma triste PE=2 SV=1',
 'id': 'tr|A0A023G8N3|A0A023G8N3_9ACAR'}
>>> 
# A numpy array of each character in the sequence
seqs[0].values
Out[63]:
array([b'M', b'T', b'T', b'T', b'T', b'S', b'A', b'T', b'L', b'I', b'F',
       b'L', b'L', b'Y', b'V', b'S', b'Q', b'L', b'L', b'V', b'G', b'F',
       b'S', b'R', b'N', b'E', b'A', b'S', b'D', b'P', b'P', b'Q', b'T',
       b'D', b'E', b'D', b'C', b'E', b'Y', b'Y', b'D', b'P', b'A', b'V',
       b'D', b'N', b'I', b'T', b'C', b'T', b'I', b'Q', b'S', b'L', b'N',
       b'T', b'T', b'G', b'D', b'P', b'I', b'P', b'V', b'G', b'C', b'L',
       b'A', b'T', b'C', b'E', b'N', b'S', b'T', b'R', b'Y', b'L', b'P',
       b'N', b'G', b'T', b'E', b'C', b'L', b'G', b'I', b'S', b'Q', b'H',
       b'V', b'A', b'N', b'R', b'M', b'Q', b'G', b'N', b'V', b'N', b'Y',
       b'T', b'C', b'P', b'V', b'G', b'L', b'C', b'Y', b'R', b'G', b'V',
       b'C', b'Q', b'R', b'N', b'G', b'L', b'G', b'I', b'D', b'C', b'W',
       b'H', b'D', b'T', b'P', b'P', b'P', b'N', b'S', b'T', b'N', b'V',
       b'T', b'T', b'K', b'A', b'P', b'T', b'T', b'L', b'T', b'S', b'G',
       b'R', b'D', b'L'],
      dtype='|S1')
>>> 
# Sequence implements a magic __str__ method that allows it to be converted to a string representation
str(seqs[0])
Out[64]:
'MTTTTSATLIFLLYVSQLLVGFSRNEASDPPQTDEDCEYYDPAVDNITCTIQSLNTTGDPIPVGCLATCENSTRYLPNGTECLGISQHVANRMQGNVNYTCPVGLCYRGVCQRNGLGIDCWHDTPPPNSTNVTTKAPTTLTSGRDL'

It's not uncommon for FASTA sequences of protein to include 'X' residues when a residue type is undetermined.

Let's see if there are any 'X' residues in sequences of our set.

>>> 
x_seqs = []
for seq in seqs:
    if seq.count('X') >= 1:
        x_seqs.append(seq)
        
print(', '.join([s.metadata['id'] for s in x_seqs]))

print("Total sequences:", len(seqs))
print("Sequences with X residues", len(x_seqs))
tr|C9W1Q7|C9W1Q7_RHISA, tr|L7LT68|L7LT68_9ACAR
Total sequences: 132
Sequences with X residues 2

Challenge

Write a function enough_cys that will return a list of sequences with eight or more cysteine residues. (Hint: A cyteine residue is represented by a 'C' character).

Once the function is written, we should be able to type:

cys_seqs = enough_cys(seqs, 8)

and get a list of sequences with 8+ cysteine residues.

We can also write a set of sequences back to a file

>>> 
with open('evasins_8cys.fasta', 'w') as outfile:
    for seq in cys_seqs:
        # The Sequence object from scikit-bio implements a 'write' method
        # that writes the sequence in FASTA format to an open file handle
        seq.write(outfile)

# Let's reassign the variable 'seqs' to point to our set with 8+ cysteines
seqs = cys_seqs

Plotting with Plotly

>>> 
!pip install plotly==2.2.1
Collecting plotly==2.2.1
  Downloading plotly-2.2.1.tar.gz (1.1MB)
    100% |████████████████████████████████| 1.1MB 946kB/s ta 0:00:01
Requirement already satisfied: decorator>=4.0.6 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from plotly==2.2.1)
Requirement already satisfied: nbformat>=4.2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from plotly==2.2.1)
Requirement already satisfied: pytz in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from plotly==2.2.1)
Requirement already satisfied: requests in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from plotly==2.2.1)
Requirement already satisfied: six in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from plotly==2.2.1)
Requirement already satisfied: traitlets>=4.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbformat>=4.2->plotly==2.2.1)
Requirement already satisfied: jupyter-core in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbformat>=4.2->plotly==2.2.1)
Requirement already satisfied: ipython-genutils in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbformat>=4.2->plotly==2.2.1)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from nbformat>=4.2->plotly==2.2.1)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from requests->plotly==2.2.1)
Requirement already satisfied: idna<2.7,>=2.5 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from requests->plotly==2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from requests->plotly==2.2.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /Users/perry/.virtualenvs/jupyter/lib/python3.6/site-packages (from requests->plotly==2.2.1)
Building wheels for collected packages: plotly
  Running setup.py bdist_wheel for plotly ... done
  Stored in directory: /Users/perry/Library/Caches/pip/wheels/cc/87/3f/6a282eb21da5d8223472bed40ee023cdcf2e9691b117969a4c
Successfully built plotly
Installing collected packages: plotly
Successfully installed plotly-2.2.1
>>> 
# By importing Plotly this way we can avoid 'logging in' to use it
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
>>> 
# We need to initialise offline plotly for the notebook so our charts display in-line
init_notebook_mode(connected=True)

A histogram of sequence lengths

Plotly histograms: https://plot.ly/python/histograms/

We'd like to be able to visualize the distribution of sequence lengths in our evasin sequence set.

Challenge

Write a function to return a list of lengths, given a list of sequences.

eg, we want to be able to write:

lengths = get_lengths(sequences)
>>> 
lengths = get_lengths(seqs)
lengths[0:5]
Out[72]:
[146, 147, 122, 205, 146]
>>> 
bins = dict(size=1)
# bins = dict(start=0.5, end=300.5, size=1)
data = [go.Histogram(x=lengths, xbins=bins)]

iplot(data, filename='Evasin sequence length distribution')

The canonical evasins are ~120 - 140 residues long. Our set contains some sequences past 200 residues in length. This might warrant further investigation (but not today !).

Let's remove all the sequences longer than 300 residues.

>>> 
long_seqs = [seq.metadata['description'] for seq in seqs if len(seq) > 300]
print(long_seqs)
['Uncharacterized protein OS=Phytomonas sp. isolate EM1 GN=GSEM1_T00005925001 PE=4 SV=1']
>>> 
sane_seqs = [seq for seq in seqs if len(seq) <= 300]
len(sane_seqs)
Out[75]:
128
>>> 
bins = dict(size=1)
data = [go.Histogram(x=get_lengths(sane_seqs), xbins=bins)]

iplot(data, filename='Evasin sequence length distribution')