Just to refresh for myself and all of you, what we have been looking at so far in the Protein Structure Series is understanding protein structures. While many text books already cover a lot of the basic stuff, these posts are especially structured for aspiring researchers who want to start working with protein structures but usually get lost in all the jargon. Having set some ground work earlier, I will now move towards handling protein structure data using computer programs and in doing so find interesting things about protein structures.
To handle protein structures in a programming environment you have to get a few things installed on your machines. Please follow instructions from here and here on installing Jupyter Notebook with Python and on installing Biopython.
At the end of this post you will be able to:
- Load a structure for processing (using the Jupyter Notebook)
- Import relevant libraries (modules) and be able to do some basic processing of your protein structures
- Understand some basic things which are wrong with protein structures and require careful handling
- Run a very simple script which will give you some basic information about your structure
A lot of people who I have talked to are interested in more advanced stuff. I just wanted to make a note here, that I understand that for some users this might be very basic stuff but I think it is very important to get the basics rights. More advanced stuff will come along soon.
Step 0: Getting the required data
So let’s start with taking this one step at a time.
The first thing you will need is the structure of your protein. This can sometime prove quite challenging. For instance, it is not always the case that you can directly look up the structure for a protein. You can have any number of things known about your protein. For instance you may only know the name of the gene that makes the protein that you are interested in, or you may know the function of the protein (but no gene/sequence/structure identifiers) or you know the sequence of the protein or any combination of factors.
In my works, its rare that I don’t have any of the above listed information. There are different resources available for example UniProt can be thought about as a one stop shop. It has a lot of data collected into one place, from gene info to sequence to structure. So if the gene info/function is your starting point head to UniProt. However as is the case with life, your data may not be present in UniProt or you may have a protein sequence. In this case the second stop should be Pfam. Pfam is a collection of protein sequences assigned into families and clans. In the case of both UniProt and Pfam you will be able to get hold of the protein structure ID. There are a couple of protein structure repositories around the world, the most famous (because I use it) of which is the RCSB PDB which uses a 4-character identifier for each of the structures in it. Increasingly RCSB has become more and more integrated, i.e. it is now almost a one a stop shop as well. You can search it by gene, function, structure ID, ligands (these are the things I know about). You can also use their tools to identify other proteins which are sequence-wise or structure-wise similar to your protein.
In this post I will assume you know the PDB Structure ID and will use that as a starting point. We will look at the protein structure of hemoglobin from the bar-headed goose (Anser indicus) available here. On this page toward the top-right is a button “Download Files” which gives you access to the resources available for this structure. You are only interested in the “Structure” so in the “Download Files” drop-down click “PDB Format” to download a PDB structure file called “1hv4.pdb” to your computer. I have earlier discussed the PDB format to some extent in a post here. For more details on what is included in the file, please see here. You should specifically look at the coordinate section. Please note that although the authors of protein structure data are encouraged to fill all the data fields, it is rarely practised. So it will often be the case that you will find data in fields missing. But we will come to that later.
At this point I would like to take a step back and give a disclaimer. Remember that this series contains information presented in such a way that you can go about using it in your research. It is also aimed at helping readers generate and solve research problems. So please note that the databases I refer to are not perfect. All resources are prone to error and these are no different. You will find incomplete data, missing data, in-correct data etc. Verifying your data is something that you must do on your end. Given the huge number of things that can go wrong, I cannot cover all angles. So I will cover those when we come across them. So I just wanted to say that if you spend ages working on something only to find that the input data was wrong, I am not responsible for that. Verifying the integrity of the data acquired from anywhere is part of the scientific process and should definitely be part of the science you do.
Step 0.5: Choice of program
So now you have the data (1hv4.pdb) downloaded onto your computer. There are two ways to process this data for more information. One is to use a visualiser like pymol or VMD which work more on the graphics but have supported command structure or programming languages which you can use to not just see but manipulate and extract data from the protein structures. The other way is to use a programming based language only with little to no support of visualising. This is not the most user friendly way, but we will be using this way, i.e. through Biopython. The reason for my use of Biopython is because 1) I want students using this resource to familiarise themselves with python 2) With pymol and VMD, you will not have as much freedom. They are good for some very popular things but you will very soon feel limited to their capacities and what they are designed to support. Working within the Python framework will allow you a lot more room and you will not be restricted to the capabilities of these visualisers. (Please note that my information on Pymol is not up to date. VMD I have been using quiet a lot – so my information on pymol is limited. You are welcome to explore that area and provide feedback here if you think I have incorrectly passed judgement).
Step 1: Find out more about the library you are using
OK, so before getting started, lets do a little recap. In the last post I talked about making use of libraries (modules) and methods contained within them to some end. A lot of time while teaching I usually get the question “How do we know which library to use” or “which function to use” or that “which function belongs to what module”. This is the wrong question to ask. The “Right” question here is to first break down your problem. Once you break down your problem, 90% of the times it will solve itself. Let’s consider the example of “processing a protein’s structure to get some information from it”. In this case, the first thing (after many other things) is to access the protein structure through python (assuming that is your mode of choice). Having decided that you go to google and type your search. My search would be “parse protein structure with python” and the first hit I get is Biopython. You may get other hits as well. But the point I am trying to make is this is how you track packages (libraries / modules). Having found the module, the next step is to see all methods available inside it to know what function to use. No one is perfect and no one (except the authors) know everything by heart, it is only with experience that you manage to retain all the useful functions.
Having clarified how to identify the right module to use, you have to now go through the User guide or Cookbook as its referred to in the case of Biopython. I will spare you the trouble for now by telling you all the functions when we start needing them. In other cases where you do not have a guide, the reference manual and stackoverflow will be your friends.
Step 2: Making sure the paths are correct
Once you have downloaded a protein structure onto your computer as discussed in sufficient detail in Step 0, you will have to place it some where. Record that location. It is a useful practice to start a new project in a new directory which is isolated from other data. There is a very high chance that you will make a mistake somewhere and end up deleting data. I have seen too many times, people deleting the wrong data. Being a separate directory helps reduce the risk of unnecessary trouble.
So make a directory somewhere on your computer and save the protein structure (PDB: 1hv4.pdb) file in it. Now start the notebook and type in the following commands.
import os os.getcwd()
the above command should tell you where you are. Make use of the command
and replace the *new_path* with the location of the directory holding your PDB file. These steps are listed with screenshots in the previous post.
It is important to get this step correct because otherwise python will not be able to see the protein structure file you want to process. (An alternate would be to paste the PDB file at the location returned by “os.getcwd()” but this is not good because that location may have other data which will have a higher chance of getting deleted – so better to make a new directory and work from there).
Step 3: Importing the right libraries
Having set up the path correctly for python to find your data, it is now time to import the required libraries. This post has already become too long so for now we will just work with some very basic things. For which you will need just two modules (libraries)
“os” and “Bio”. Where the “os” library you have imported in Step 2 which allowed you to use two methods “getcwd()” and “chdir(*new_path*)” and “Bio” is the library which you will use for processing your protein of interest.
Step 4: Accessing your protein structure for processing
Now we are ready to bring our structure into the memory of your computer as a variable. My last sentence may appear a bit confusing to those who have no or little programming experience. So let me briefly explain this. A variable is an entity used in the programming environments which holds a certain value. This value is stored in the computer’s memory for as long as the variable is not overwritten or destroyed. Memory management is something important in programming, because if you end up doing something weird, chances are very high that you will run out of your computer’s memory and your operating system will crash. I will explain things not to do when we get to them. If you follow instructions laid down here, you will remain within the boundaries of your computer’s RAM. Remember you can either run one instruction per jupyter-notebook cell or multiple. Previously we were using only one per cell. Now I will use multiple per cell but please feel free to separate them into different cells.
(Caution: Not all types of commands can be separated across cells – for example some commands are grouped and are recognised as groups – so you cannot write them in different cells [I think!!] e.g. loops and conditional statements – but don’t worry about those at the moment – just follow the instructions for now). Here is what you need:
from Bio.PDB import * p = PDBParser() structure = p.get_structure('A', '1hv4.pdb')
If you run the above 3 lines of code (see Figure 1), you will have imported your protein structure. I will try to explain each line explicitly.
Notice that I didn’t use “import Bio”, instead I used “from Bio.PDB import *”. This line is similar to what I have explained before.
Biopython contains many methods. We only want to use the ones related to structure which are present in the PDB part of it (i.e. Bio.PDB).
Writing “import *” brings only methods in “Bio.PDB”. This is followed by “PDBParser()” which creates an object “p” which can then hold a protein structure.
In the last command “PDBParser” has a member “get_structure” which we will use to explicitly read a structure.
At the end of these three commands the structure 1hv4.pdb will be stored in the variable “structure”.
However, you will get a lot of “Warnings”.
As I mentioned earlier, a lot of things can happen in the PDB. This “Warning” display may or may not be useful. It depends on the type of data that you have used. Let’s try to detect the source of these errors. But to do that some new stuff has to be introduced. So let’s do that first.
Step 5: Terminology
In a previous post where I was talking about protein structures and the format of the PDB, I explained some stuff in detail. If you haven’t read that please have a look at it now. If you are lazy and don’t want to – I will briefly talk about it here. I showed the following table.
COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------------- 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number. 13 - 16 Atom name Atom name. 17 Character altLoc Alternate location indicator. 18 - 20 Residue name resName Residue name. 22 Character chainID Chain identifier. 23 - 26 Integer resSeq Residue sequence number. 27 AChar iCode Code for insertion of residues. 31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms. 39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms. 47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms. 55 - 60 Real(6.2) occupancy Occupancy. 61 - 66 Real(6.2) tempFactor Temperature factor. 77 - 78 LString(2) element Element symbol, right-justified. 79 - 80 LString(2) charge Charge on the atom.
In this table note that information in column 13 – 16 is e.g. Chain identifier whereas 55 – 60 is the Occupancy. What I want to show here is that the coordinate section (and other sections) of the PDB contain different information.The PDBParser() that we created and used on our structure 1hv4.pdb extract information like in the above table and mapped it to the structure object.
The exact way in which the structure object will hold the information is best shown in Figure 2 (Obtained from the Biopython Cookbook).
- A structure consists of models
- A model consists of chains
- A chain consists of residues
- A residue consists of atoms
Brief notes on each. Models are alternative representations of the protein’s structures. For in case of X-ray Crystallography you will have a single model in your protein structure, whereas in case of NMR you will have multiple models. A protein structure may have “1” or more chain. For example some proteins are monomeric and therefore will have a single chain, whereas other proteins may be multimeric like hemoglobin which is tetrameric and therefore has “4” chains. Each chain then is a string of covalently attached amino acids commonly referred to as residues. Residues, being amino acids, comprise of atoms.
So having parsed the structure and moving it into the memory as an object, each of its individual components become accessible. Now let’s revisit the warnings that were showing up. The first thing to note about Errors/Warnings in python is to see which line of the code is generating the error. Python will point out two types of line, 1) in your code 2) in the back-end code which your code is triggering. So in this case we don’t have a sufficient amount of code (just a couple of lines), however, also in this case we do not get error (1). If you notice closely you will see a pattern in the warnings. You 1-warning (3-lines) repeat across the screen. How do we know its the same warning because as mentioned in (2) the notebook displays line number on the code at the back-end which is breaking down. You will see “StructureBuilder.py:90” which is repeating every fourth line. The warning actually says that there is a discontinuity in chain A (warning line 2), chain B (warning line 5) and so on. You will also see that the warning message lists “line 9659”, “line 9702” and so an as possible sources of the warning raised.
I know a lot of students who ignore warnings. This is NOT a good idea, you should always be on top of your system, one step ahead. I cannot emphasise enough the importance of understanding your system. So in the spirit of that, let’s open up the 1hv4.pdb file in a text editor (please use notepad++). Once the file is opened up, scroll to line 9659 and 9702. These are shown in the figures below.
So let’s take a minute to understand what is happening. The PDBConstructionWarning is commenting that these lines in their respective chains are discontinuous. Is it true? How are these chains looked at? Well, as I explained earlier, chains are amino acids connected together. So if they are connected together they should have consecutive numbers. So if you go back to the PDB 1hv4.pdb file and look at line 1791, you will see a line starting with “TER” and having a residue number 141. Chain A ends there but on line 9659 restarts again. But when it restarts at residue number 151 (Not exactly restart because now its HETATM record). So why the gap? As is commonly the case, most proteins are bound to something. In this case the protein chain A is bound to the HEME co-factor. Since “HEME” isn’t exactly an amino acid, its recorded as HETATM and since its not covalently bound to the protein it is considered a ligand and hence the numbering is disconnected from the main chain, but the chain code remains the same so that it can be indicated that this “HEME” is bound to chain A.
So while in this case the warning is not useful it is important to understand what is causing it. There will be many cases where amino acids residues in a chain will be missing, in which case you will be returned the same error. So that requires caution.
Step 6: The “Hello World” of Protein structures
OK, so now that you have understood all that has happened till now let’s move on to the ACTUAL processing bit. Given the size of this post and my concern that people will not read it, I will keep this section short.
In this exercise we will try to do two things.
- Get the sequence of the protein (from the structure)
- Get the total number of residues in the protein
Before I start on that, there is one more thing that needs to be pointed out. Hemoglobin is tetrameric, i.e. it will have 4-chains. The structure 1hv4.pdb is showing 8-chains A through H. So why is that the case? This post is probably not the right place to address this, because this answer is better suited for a post on X-ray crystallography and protein crystallisation. At this point just think that the author of the PDB provided two copies of hemoglobin, hence 8-chains instead of 4.
So first program now.
for model in structure: print(model) for chain in model: print(chain) for residue in chain: print(residue)
The above program needs to be understood in the light of Figure 2. You create a “structure” object which contains the structure of your protein from 1hv4.pdb. As mentioned earlier a structure can have 1 or more models (1 in this case since 1hv4.pdb is a structure solved through X-ray Crystallography), and each model has 1 or more chains (8 in this case A through H) and each chain then has amino acids (also called residues in this case).
What we have done in the above situation is to use a for loop (i.e. باری باری سب کے لۓ چلے گا ). What this means is that you inside the structure object there are 1 or more models and inside the models can be 1 or more chains. A for loop can allow you to access those objects.
When you run the above code, it will display a set of output lines. Let’s look at a few of them. The first 4-lines are:
<Model id=0> <Chain id=A> <Residue VAL het= resseq=1 icode= > <Residue LEU het= resseq=2 icode= >
The first line brings up the model ID. In our case since we only have 1 model, this line will appear only once. If we had 20 models (usual for NMR structures), we would have had this line appear 20 times. The first instance would have been “<Model id=0>” and the last instance would have been “<Model id=19>”. (In python indexing starts from 0 instead of 1).
We have 8-chains. So the second line will appear 8 times, each time having a different label (A – H). The third line shows the first residue “resseq=1” is a valine (VAL) and since this was written with a starting word of ATOM instead of HETATM, the “het=” is left blank. (Ignore icode for now, this post is already very long). Similarly, the 4th-line shows the following residue in a similar format till the last residue in the chain A after which the HETATM “HEM” is listed followed by chain B.
What we see here is that the original object “structure” now holds our entire structure. We can now start looking at different things. The first thing I listed was to extract the sequence from our protein. Since we have 8-chains we will get 8 sequences.
Step 7: Simplifying problems
A very useful thing to do when designing a program is to break it down to simpler bits and then start working your way up to solving the whole problem. In this case we have 8 chains. So instead of trying to get all sequences at the same time, lets try it chain by chain. So how do we access the structure 1-chain at a time?
Easy. For loop ran over something which had many things in it. We will extract from there using:
model = structure chain = model['A'] for residue in chain: print(residue)
The output from the above code extract only chain “A” and lists the residues only from it, from residues 1 through 141. But we don’t want to see the sequence listed like this. We want it listed out without the extra stuff. So how to we get that? put a “.” after residue in print(residue) so that it looks like print(residue.) and with the cursor right after the “.” press “tab key”. This will open up a drop down list from which you can select different things. The “thing” you want to look at is “residue.resname”. You don’t have to use the drop down, after a while you will start remembering these and can write them directly. So select that, so that your code looks like:
model = structure chain = model['A'] for residue in chain: print(residue.resname)
Running the code now will bring up only the three letter code for amino acids. Hmm, this still doesn’t look nice enough. So let’s try listing these in one line. Easiest way to do that is to add all these amino acid codes to a list and then just display the list. To do that we need to add three more things.
- Create a blank list using 
- Fill that list (using append) with the name instead of print it to screen
- Display the list on screen using print.
So let’s modify the code a bit.
amino_acid_list =  model = structure chain = model['A'] for residue in chain: amino_acid_list.append(residue.resname) print(amino_acid_list)
Now you have all the names listed in a single list, which you can see in one go without having to scroll. While this isn’t exactly the sequence it does give you an idea. I will talk about functions in python next time and we will try to make this list into a protein sequence as seen on sequence databases. For now, I will leave it to this. Lastly I talk about the size of the protein. This means “number” of amino acids in the protein. This part is each now. After the “print(amino_acid_list)” statement add “print(len(amino_acid_list))” so that you code looks like this:
amino_acid_list =  model = structure chain = model['A'] for residue in chain: amino_acid_list.append(residue.resname) print(amino_acid_list) print(len(amino_acid_list))
Now the output will be a list followed by the number 142. This is the number of entries in the list. If you recall that the number of amino acids listed in the protein PDB run from 1 to 141, but why is this number 142? Look closely at the last entry in the list. Its listed as a “HEM” which is the entry for the co-factor. Remember that Biopython doesn’t know the different between a protein’s amino acid and its co-factor so it will record all of them in the same list because we used the chain ID and the HETATM is also listed as having the same chain ID “A”. To be able to list only the amino acids, you will have to make the distinction based on the “het=” value. So subtracting the “HEM” count from 142 gives us 141 which is the “true” length of the amino acid sequence.
In this post we set out to write our first program to load a protein structure into memory and process it in some way. Before I got to that stage I had to explain a lot of things, going through each of which took a while and hence the long post. I set out to meet the following objectives in this post.
- Load a structure for processing (using the Jupyter Notebook) [Done]
- Import relevant libraries (modules) and be able to do some basic processing of your protein structures [Done]
- Understand some basic things which are wrong with protein structures and require careful handling [Done]
- Run a very simple script which will give you some basic information about your structure [Done]
In the next post I will introduce function in python and use it to convert this sequence of this protein into the sequence we see in sequence databases. Hopefully you will find this post simple (I have tried to explain everything in minute detail) and easy to follow (if you have the stamina to read it all the way through :D). The next post will come soon, the exact date and time of which I have no idea about. Meanwhile if you have any questions feel free to write to me or ask on Facebook. Please feel free to let me know if you find any errors in this post.