The Protein Structure Series: Post – V

By | December 14, 2019

From last time

In the last post, we moved from explaining the theory of proteins, especially their 3D structures to actually handling the structures with python in the jupyter notebook. If you are new here please have a look at the preceding posts to catch up. In this post, I will assume you have looked at post “IV” from this series. In this post the goals are to build on things where we left them last time. The goal for this post is:

  • write a function in python and use it to write out the sequence of a protein in a recognisable format

so let’s get started.

Before we start, in case you still haven’t read the previous post, we used biopython within the jupyter notebook framework to load the protein structure from a haemoglobin molecule. To do that you will need to run the following code:

from Bio.PDB import *
p = PDBParser()
structure = p.get_structure('A', '1hv4.pdb')
amino_acid_list = []
model = structure[0]
chain = model['A']
for residue in chain:
    amino_acid_list.append(residue.resname)
print(amino_acid_list)
print(len(amino_acid_list))

The output from the following code is a list of amino acids constituting the protein and a number (142) listing the residues in list.

The goal is to convert the content of “amino_acid_list” to single letter amino acid codes. For instance Glycine, the three letter code of which is “GLY” should be represented by a one letter code “G”, and the same for all other amino acids. To this end, we need to make a “look up” which gets the three letter code and converts that to a single letter code.

Functions in Python

Before we write our first function some basics first.

  • A function is a piece of code which contains some instructions which have to be repeated again and again. So instead of writing the same statements we call the function.
  • The scope of the variables in the function is limited to the function itself.
  • We need to use the “def” keyword to write a function.
  • To return something calculated within the function we need to extract it from the function using “return”.

My first function in Python


# Here is an example of a function which takes
# the temperature in degree C (Celsius) and
# converts it into degree F (Fahrenheit)

def TempC2F(value):
    F = ((9/5) * value) + 32
    return F

The above is a very simple example of a function where temperature is converted between units. The “def” word is followed by the name of the function “TempC2F” which must be provided with a value “value”. Inside the body of the function a very simple mathematical relationship converts the temperature from one unit to another. The return statement then provides the output.

To use the function, we can simply call it by saying


C2F = TempC2F(100)
print(C2F)

This should return 212 which is 100 degree Celsius in degree Fahrenheit.

This example only has one line of code and therefore may not appear to be very useful to write as a function. However, functions are mostly quite large with numerous lines and therefore instead of repeating many lines, putting them together as a function is far more elegant (and compute friendly, but more on that some other day).

So now that we know what functions are and how to write one, let’s go back to the problem that we started from.

We have a list of three letter amino acid codes. We want to convert those three letter codes to single letter codes. So let’s break this problem down. We know which three letter code corresponds to which one letter code. So this becomes a look up problem.

Alanine Ala A
Cysteine Cys C
Aspartic acid Asp D
Glutamic acid Glu E
Phenylalanine Phe F
Glycine Gly G
Histidine His H
Isoleucine Ile I
Lysine Lys K
Leucine Leu L
Methionine Met M
Asparagine Asn N
Proline Pro P
Glutamine Gln Q
Arginine Arg R
Serine Ser S
Threonine Thr T
Valine Val V
Tryptophan Trp W
Tyrosine Tyr Y

We will make use of two lists. One containing three letter codes (2nd column) and the other containing single letter codes (3rd column). The lists will look like this:


threeLC = ['ALA','CYS','ASP','GLU','PHE','GLY','HIS','ILE','LYS','LEU','MET','ASN','PRO','GLN','ARG','SER','THR','VAL','TRP','TYR']
oneLC = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']

Note that the amino acids are listed in the same order in both lists. We will use this to our advantage. Our function will have the following steps:

  1. Use a “for” loop iterate over each amino acids in the sequence
  2. for each amino acid look up position in “threeLC” and save index
  3. for each amino acid pickup the one letter code on the index saved in 2

So let’s translate these three lines into a function.

def three2one(aminoacid):
    index = threeLC.index(aminoacid)
    sL = oneLC[index]
    return sL

Looks simple enough. Right? It is. The first line uses the “def” keyword to declare the function which we name “three2one” which is given a three letter amino acid “aminoacid”. “threeLC” is a list with a method called “index” — which means that if you use this method you can get the index of the amino acid which you searched for. The same index is then used with “oneLC” to get the single letter code. Remember that we put the amino acids in the same order in both lists. So the index is conserved across both lists. Finally the single letter code is returned. To use this function one simply does:

OneLetter = three2one('GLU')
print(OneLetter)

This displays “E”. Ok, so this solves the conversion of one amino acid, but we have 141. So now comes the for loop part which will use the function to generate the single letter codes for 141 amino acids.

This would now look like:

for i in amino_acid_list:
    OneLetter = three2one(i)
    print(OneLetter)

What this code does is that it iterates over the list, and every time passes the three letter amino acid code to the function three2one. This will now output a list of single letter codes. But it also generates an error. The error is because “HEM” is not in our list of 20 standard amino acids. Our simple function lives in an ideal world where the only thing requested from it would be the conversion of one of the 20 standard amino acids. Since this is not the case we need to modify our function slightly so that it can be more robust. To do this we will make use of the “try” and “except” statements. In this case, the code breaks when it is asked to convert something which is not in the list. We can use the following code:

Error handling in python

def three2one(aminoacid):
    try:
        index = threeLC.index(aminoacid) 
        sL = oneLC[index]
    except:
        sL = "X" 
    return sL 

The try statement now will run unless there is an error, in which case the except part will run. This can (for now) be thought of being similar to the if/else statement. If something is searched for which isn’t in the list, the except statement will return an “X” in place, which in certain cases can be thought of as “unidentified”.

The code now prints out everything, including an X at the end. However, this is still not what we wanted. We want a sequence of letters. To do this we will create a new list and each result of the function will be appended to that list. The code to do that after updating your function with try/except looks like this:

sequence = []
for i in amino_acid_list:
    OneLetter = three2one(i)
    sequence.append(OneLetter)
print(sequence)

The answer is now a list of single letter amino acid codes, similar to the one earlier where we say three letter codes. Finally we will use the “join” function with no-space as a separator to get our final answer. The code there will look like this:

sequence = []
for i in amino_acid_list:
    OneLetter = three2one(i)
    sequence.append(OneLetter)
print("".join(sequence))

This now produced a result which is how we see sequence data in protein sequence databases. So in this exercise we went from handling a protein structure to writing functions and using one to get the sequence of amino acids making up the protein structure.

Final Code

The final code looks like this:

from Bio.PDB import *
p = PDBParser()
structure = p.get_structure('A', '1hv4.pdb')
amino_acid_list = []
model = structure[0]
chain = model['A'] 
for residue in chain:
    amino_acid_list.append(residue.resname)
threeLC = ['ALA','CYS','ASP','GLU','PHE','GLY','HIS','ILE','LYS','LEU','MET','ASN','PRO','GLN','ARG','SER','THR','VAL','TRP','TYR']
oneLC = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y']
def three2one(aminoacid):
    try:
        index = threeLC.index(aminoacid)
        sL = oneLC[index]
    except:
        sL = "X"
    return sL
sequence = []
for i in amino_acid_list:
    OneLetter = three2one(i)
    sequence.append(OneLetter)
finalSequence = "".join(sequence)
print(finalSequence)

Concluding remarks:

In this post we learned about

  • writing functions
  • error handling using try/except statements
  • extracting a sequence from a protein structure
  • using a function to convert that sequence into a format which is recognised by databases

At this point you may still be questioning that we haven’t moved very far from where we started from. The sequence is already known, why can’t we just copy it? Well the answer to that question will be that most of the times you might be handling more than just one structure and you would need to automate your workflow. Moreover, this series is about learning how to handle data using programming and in a more or less automated way. Now that we have the sequence from a protein structure we can look at doing other things with our protein. In the next post we will look into some of those exciting things.

 

Leave a Reply

Your email address will not be published. Required fields are marked *