In this series, so far, we have looked at protein structure data and introduced a bit of python code to analyse it. The code below loads a protein structure and allows the user the extract the protein’s amino acid sequence.
from Bio.PDB import * p = PDBParser() structure = p.get_structure('A', '1hv4.pdb') amino_acid_list =  model = structure chain = model['A'] for residue in chain: amino_acid_list.append(residue.resname) threeLC = ['ALA','CYS','ASP','GLU','PHE','GLY','HIS','ILE','LYS','LEU','MET','ASN','PRO','GLN','ARG','SER','THR','VAL','TRP','TYR'] oneLC = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y'] def three2one(aminoacid): try: index = threeLC.index(aminoacid) sL = oneLC[index] except: sL = "X" return sL sequence =  for i in amino_acid_list: OneLetter = three2one(i) sequence.append(OneLetter) finalSequence = "".join(sequence) print(finalSequence)
In this post we will try to extract more information from the sequence that we have already obtained from the structure. But before we can do that, it is important to have some background on the subject matter we will cover today, which relates to “motifs”.
Protein Sequence Motifs
In the context of this work, a motif is a pattern of amino acids which appears in a sequence and hence a structure of a protein. For instance if we take the example of a sequence:
an example of a motif can be
which appears in the sequence (text in bold).
A motif does not always have to comprise a continuous stretch of amino acids. For instance a motif can be two stretches of two amino acids with any number of amino acids in between. For instance an example of such a motif can be “AA” followed by any number of amino acids and ending with “EE”. This hypothetical motif also occurs in the above sequence:
where the “AA” and “EE” amino acids are shown in bold with in-between amino acids represented as italic. In short a motif can be any pattern of any length with any number of amino acids in it.
Importance of motifs
A logical question at this point would be that why should we be interested in finding this arcane pattern in the first place. What value do a few letters hold? The answer to that question is simple. If you recall one of the earlier posts in this series which talked about the primary, secondary and tertiary levels of protein structures, I noted that a sequence of amino acids folds up in three-dimensional space where amino acids present at different points in the sequence can come together in space as the structure folds. Some of these amino acids can come together to line potent pockets on the surface of the protein. Catalytic reactions can take place within these pockets. Other amino acids can come together to form interactions which might be necessary for the structural stability of the protein. Either way, detecting these “key” amino acids in the sequence of the protein can play a crucial role in helping us identify the function of protein if it is experimentally not known.
In today’s exercise we will construct a program which, given a sequence of amino acids and a motif, can detect if the motif is present in the amino acid sequence and where. Yes, some of you may think why we need to write a program for this when you can just use the Ctrl + F key combination and search for our string? The answer to that is this program will save you from pressing Ctrl + F on a file containing 10s or 100s or even 1000s of sequences and manually detecting and enumerating motifs. In short, this program will help you automate your search.
Making a sequence motif finder
In the last line of the code provided above, the variable
now holds a string of letters, the amino acid sequence acquired from the protein’s structures. [Please note that while the code is self-sustaining it does require input files. To ensure that the code runs correctly please see earlier posts on how to set up the environment correctly]. It is within this sequence that we have to look for our patterns, i.e. motifs. In essence what we have to do is to develop a search function, or rather use the existing search methods with our data. So let’s start there.
We can think of a hypothetical motif “VVV” and we will write a few lines of code that will search for this motif in the sequence we have just found.
The simplest way to search for the presence of a (continuous) pattern in a string is to use the find() command. In our case we can use it like this:
This returns a number 106. This number represents at what position does the pattern ‘VVV’ start in our string. So what we can then do is pass this number as an index back to the string like this:
This prints the letter ‘V’, which is the first letter of our pattern. To get the remaining two letters we need to tell python to give us the region between “a starting” index and “an ending” index. If the starting index (position of first ‘V’ in the sequence that matches our pattern) is 106, then the ending should be at 108. Right? So we need to extract the slice of the string that starts at 106 and ends at 108 (as our pattern is just three letters). To do this we can do the following:
finalSequence[finalSequence.find('VVV'):finalSequence.find('VVV') + 2]
but this just returns “VV” and not “VVV”. The reason for this is that python looks up-till the last index but does not include the last index (indexing in python start from “0”). Therefore to get the full three letter pattern we need to do this:
finalSequence[finalSequence.find('VVV'):finalSequence.find('VVV') + 3]
So is that it? That was easy right? Yes. But this will not work in every case. Yes, this is “a” solution for searching motifs when they are continuous, but what about the hypothetical motif that we made earlier. “AA” and “EE” with any number of amino acids in between. This is where things get interesting. You can still make use of the find() function, look for all “AA”s and all “EE”s and then look for them when pairs of “AA”s and “EE”s appear in the correct order. This however is not called automated and appears to be a very tiring process. This is where “regular expression” comes in. It is outside the scope of this work to discuss regular expressions but I am including a cheat sheet which should help you people get started.
While there may exist several ways to construct and use regular expression, we will use the regular expression library in python. So let’s get started on that.
Regular Expression in Python
The first thing to do here is to load the relevant library which we do using:
The first thing that I would do at this stage is to reproduce what we did with find() to see if minimum capabilities (just looking for three continuous letters) match. To do this we will have to add the following code:
import re re.search('VVV',finalSequence).span()
which will return (106, 109) – which if you recall are the starting and ending indices we used in the find() example to recover our pattern. This demonstrates that we can now use “re” for matching other kinds of patterns. So now we will work with another hypothetical pattern which would be “VV” followed by any number of amino acids until it comes to the next “V”. Any number of amino acids can also be “0”. For beginners it can be a bit difficult to construct a regular expression of any kind, but for this one instance I can demonstrate in some detail. So quickly looking at the cheat sheet we see that the expression “.*” can be used. Let’s see how this behaves. Oh and we also have to use a different command, findall instead of search.
import re re.findall(r'VV.*V',finalSequence)
The above returns a string which is not accurate, well at least not in the sense that we want it to be. The reason for this is that the algorithm is “greedy”, a term that I will explore in another post on a different day. For now, we need to follow the pattern “.*” with a question mark, like this “.*?”, which will ensure that we come up with the results that we want. Using the following code:
import re re.findall(r'VV.*?V',finalSequence)
which is what we want to see. Notice that when we said “.*” we are actually saying that between “VV” and the last “V” we can have any number of amino acids and it can also be “0”. This is why we see “VVV” in the result. A manual search will demonstrate this as the only two solutions which start with a “VV” and end with the first available “V”. You may now have noticed that while we do find the pattern, we do not know where it occurs in the sequence. This can easily be addressed by reverse lookup. What that means is that now you know exactly what the pattern contains, e.g. when you started with “VV” and “V” you didn’t know the amino acids which were in the middle, but after the above code, you know that it will be “VVAALV”, you can use the search function. We will also have to use the for loop to automate this lookup as you don’t want a manual step and also because there are multiple results returned by the findall command. So here is what the code will look like:
import re results = re.findall(r'VV.*?V',finalSequence) for i in results: print(i + ' :: ',re.search(i,finalSequence).span())
The above code will print out
VVAALV :: (61, 67) VVV :: (106, 109)
The final code (from top to bottom) would look something like this:
from Bio.PDB import * import re p = PDBParser() structure = p.get_structure('A', '1hv4.pdb') amino_acid_list =  model = structure chain = model['A'] for residue in chain: amino_acid_list.append(residue.resname) threeLC = ['ALA','CYS','ASP','GLU','PHE','GLY','HIS','ILE','LYS','LEU','MET','ASN','PRO','GLN','ARG','SER','THR','VAL','TRP','TYR'] oneLC = ['A','C','D','E','F','G','H','I','K','L','M','N','P','Q','R','S','T','V','W','Y'] def three2one(aminoacid): try: index = threeLC.index(aminoacid) sL = oneLC[index] except: sL = "X" return sL sequence =  for i in amino_acid_list: OneLetter = three2one(i) sequence.append(OneLetter) finalSequence ="".join(sequence) print(finalSequence) results = re.findall(r'VV.*?V',finalSequence) for i in results: print(i + ' :: ',re.search(i,finalSequence).span())
Now we are at a point where using the cheat sheet and the content of this post you can design an expression to search for your pattern of interest.
In this post, we built on the sequence we extracted from the structure in our previous post. We were able to add a few lines of code to our existing program to identify patterns (motifs) in the sequence. The patterns can lead to important predictions about the stability and function of the protein. In order to search for motifs we looked at regular expressions and how to use them in python and make use of it in a way that any motif can be searched for. While this post is not supposed to give insight into building regular expressions, the cheat sheet should help and plenty of other resources exist on the internet where this information can become available to you. If you are still stuck, feel free to contact IDRACK and I will help out.
In the future posts we will continue to build on our understanding of protein structures and analyse data in other ways which can lead to useful insights. So keep following this series.