The previous post ended with:
“The next post in this series will start by introducing the format used by RCSB in its structure files (PDBs) along with the first use of programming to interact with that data (apart from simply eyeballing it).”
In the last post I brought up why we cannot “see” protein structures, and therefore observe them using X-ray and NMR methods. The structure of proteins are then visualised in the 3D coordinate space, where each atom in the protein structure is given three coordinates (x, y and z). I think it is necessary to bring up at this point that PDB (the file in which protein structures are stored on RCSB) gives coordinates in Angstroms (1e-10, which is an order of magnitude smaller than nanometers). Remember this, because when we start calculating distances (and other quantities based on the distance unit) between atoms, it will be in Angstroms. Most students that I have come across sometimes confuse or altogether skip this detail. So going back to PDB structures, we then visualise these structures using molecular visualisation softwares (e.g. VMD, Pymol, Avogadro). We will talk about these in a short while, first let’s introduce the structure format.
The RCSB database provides a resource detailing the file, and it’s necessary sections (find it here: RCSB PDB format). So for the sake of not being redundant I will not repeat all that here. What I would like to bring to attention here is the section of the PDB file dedicated to structure of the molecule (read details here). This section starting with the word “ATOM” or “HETATM” has several columns. In order from left to right these are:
COLUMNS DATA TYPE FIELD DEFINITION ------------------------------------------------------------------------------------- 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number. 13 - 16 Atom name Atom name. 17 Character altLoc Alternate location indicator. 18 - 20 Residue name resName Residue name. 22 Character chainID Chain identifier. 23 - 26 Integer resSeq Residue sequence number. 27 AChar iCode Code for insertion of residues. 31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms. 39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms. 47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms. 55 - 60 Real(6.2) occupancy Occupancy. 61 - 66 Real(6.2) tempFactor Temperature factor. 77 - 78 LString(2) element Element symbol, right-justified. 79 - 80 LString(2) charge Charge on the atom.
The reason for explicitly stating these entries is that a structure can be manipulated based on these entries. For example each protein structure comes from a gene, and is therefore a stretch of amino acids that is covalently linked. Each set of covalently linked amino acids is referred to as a “chain” and mostly has a single or two character identifier. For instance it can be A or d or 0 or 9. The important thing to remember here is that column 22 will hold one character for each chain in the structure. It is also important to note that in recent time complex assemblies have emerged which have many chains. So while a single character can be between A – Z, a – z or 0 – 9, it can at most save 26 + 26 + 10 = 62 chains. Structures of viral capsids routinely have more than 62 chains. These structure therefore use two characters (e.g. 9h or Ag: a combination of A-Z, a-z and 0-9) to denote their chains. While the structure still be written in a PDB format, it will NOT be a PDB file as the chain label spans 2 columns now. Many programs written to read in properly formatted to read PDB structures will therefore break. At this point I would also like to mention other limitations all of which will arise because PDB has fixed number of column for everything. For instance notice that the residue column is 23 – 26, which means that the numbers can be 4-characters long. So what do you think happens when you try to save 10,000th residue? These are all important things to consider when you start processing structures. It will help troubleshoot and avoid unexpected surprises. (Remember that while PROTEINS rarely touch 10,000 amino acids, more on this when we come across this)
While I cannot possibly cover all things that can go wrong, from the preceding text you should be able to get a feel for what sort of things CAN go wrong. I will point more things out in the following posts as we run across them.
A note on molecular visualisation libraries:
Now let’s have a look at the softwares which are used to plot the x, y, z coordinates in 3D space and visualise the structure. While there are such softwares out there, I understand them to work in one of two ways. Notice the atom name entry in PDB. Atoms will be rendered with bonds between them, depending on their distances, e.g. a single covalent bond will be displayed between them if they satisfy the distance criterion for a covalent bond. This is a distance-based mechanism of guessing bond orders. This may not always be correct, for reasons that I will cover later when I introduce more advanced concepts. For now let’s just look at how visualisers work. So while one method of rendering is distance based, the other is where predefined libraries tell the visualisers when to display bonds and when not to. This can be further complemented by the “connect” record at the end of the PDB file, which explicitly states which atoms to connect.
The reason for bringing this up here to inform you that while in most cases the connectivity information is “guessed” accurately, there may be cases where a bond may be displayed where there shouldn’t be one. While for most rendering purposes this is of no significance, it becomes very important when you have to guess bond orders and balance charges. A bond missing, when it should be there or present when it shouldn’t be can mess up everything. While not relevant to proteins (mostly), it is super-important for small molecules.
Programming around protein structures:
At the end of the previous post, I said that I will be introducing programming exercises in this post. I have pushed that to the next post. In this post, however, I will highlight briefly what can be done with protein structures and why would one do something like that. I have previously mentioned the importance of structures, so everything you can think about related to protein structures is important. Here are a few things we can do:
- Predict the final shape of a protein structure from its sequence. While in this case the structure is not present, the protein folding problem to date has no solution. I introduced this in an earlier post. Stating it here puts it into context of how programming can be used to design and run simulations to predict the final shape of a structure, given only its sequence as a starting point.
- Another approach to obtain the structure of a protein, from its sequence is to use homology modelling. This exercise looks at similarity between protein sequences and given enough similarity, it can be stated that the two proteins will share a similar final shape. While this argument is more or less correct, literature in biology has seen the use of this method in areas where it shouldn’t be used. I will cover homology modelling in detail in a later post.
- While the above two points discuss how to obtain structure from sequence, the goal of this pots is to understand what can be done with a structure once it is solved. One thing that can be done is to measure the total charge of a protein (i.e. the difference between the acidic and basic amino acids).
- Radius of gyration, for a protein, can be quantified. This term will give a rough measure of the globular nature of a protein. When a protein is simulated, the radius of gyration can be used as a measure to look at the stability of the protein, for instance, a change in the radius could indicate a change in compactness of the protein.
- SASA (solvent accessible surface area), is a measure of how much area of the protein’s surface is exposed to solvent. This is particularly useful, since binding pockets, which house drugs of importance, are hydrophobic. If they are hydrophilic, than the ligand will have to complete with water for a position in the binding site. Also, the hydrophobic nature of residues constituting the core of the protein gives weight-age to the “hydrophobic collapse” theory, which states that the protein initially collapses because of these residues and then folds around these residues to take up its final shape.
- Bond lengths, Angles, dihedrals, impropers can be computed amongst other things from protein structures. These terms are coupled with specific force fields (more on this later) to quantify the stability of the protein.
- Two or more protein structures can be pairwise compared to see how similar they are. This helps inform phylogenetic analysis.
- Interaction between different drugs and protein structures can be quantified to identify lead compounds which may have potential for invitro and invivo studies.
- Similar to “8”, protein-protein and protein-DNA interactions can be studied.
- Similar to “7”, protein structure comparison between identical structures can be used to compute RMSD, providing a measure of distance between two structures. This is commonly calculated when a protein structure is simulated, between states observed during the simulation process and a reference state of the protein. This helps inform on the flexibility of the protein.
I have noted down some of the points from the top of my head. These are by no means exhaustive, a lot more can be done with protein structures. We will do some of these in the following posts. These just illustrate instances where programming can be coupled with protein structures.
While in the previous section, I introduced some of what we can do with protein structures, it’s time to do some of those things. Before we do that however, I think it’s important to introduce some of the ways in which we can achieve this. Some ways which I can think of are:
- Write your own code from scratch.
- Use libraries available for handling protein structure (e.g. Biopython)
- Use program, in which interfaces are available (e.g. VMD)
For me all three options are equal. I usually ask the following questions:
- Speed. Programming in C++ will definitely be quicker than in Python. But by how much? Is programming to run a task faster worth the time invested in writing a code in C++ when something is available in python? Or VMD?
- Stability. If I release a program for public use, which uses something made by someone else, will it continue to be supported? What types of features is it using? Is there a chance it will break down when the new version of some other related thing is released?
- Operability. If I am releasing something to public, can it run on Windows, Mac and Linux? It is not nice to force users to one platform just because they have the need to use your developed programs, so is it worth programming for a bigger audience or is my program going to specifically be used by the Linux community only.
These are just some of the thoughts/questions that I think about when choosing the platform for designing my programs. However for the purpose of these posts I will try to solve at least one problem in each of these ways i.e. code from scratch, Biopython and VMD. However, given that this post has already become too long and people will probably not reach till this point I will push that to the next post.
In the next post, I will use Biopython to import a structure and do some fairly introductory calculations.