Finally managed some time to continue with this series.
The last post ended with:
“The next post in this series will be a continuation of this work and will address our ability to determine protein structures experimentally and lead into public protein structure databases which we will be frequently using to analyse and get meaningful results from the analysis.”
So I will pick up from here.
A significant amount of literature can be found on the methods surrounding experimental determination of protein structures. This post does not get into those. What this will however do is to provide just enough information for a novice to be able to appreciate what a protein is and why determining its structure matters. But before we talk about the methods, we need to discuss something more fundamental. What is a molecular structure? In the previous, rather rudimentary, post I introduced the higher order secondary structure elements of proteins, which contribute to its final shape allowing it to undertake a function. In this post, I will go to the lower end of the structure, i.e to the level of atoms being held together by fundamental forces, bringing about the molecular structure.
Can we see a molecular structure?
Figure 1 in the previous post showed the four-levels of protein structures. The primary structure is what occupies the fundamental level. I mentioned that it comprises of amino acids. What I didn’t mention is that each of those amino acids is a molecule itself comprising atoms. We cannot really see atoms. So how is it that we can see protein structures? One simple way in which we can explain “Seeing” is by describing it as a function wavelength of radiation in the electromagnetic spectrum [for details see this]. Light (or I should rather say the visible region of the electromagnetic spectrum) by definition is not sufficient to see atoms. So things made up of atoms are only visible when they reach a sufficient size. So while we may be able to see aggregates of proteins, single protein molecules are not “see-able”.
Why represent structural components as solid spheres?
Another aspect of structures is that when one refers to a structure, one is actually talking about a set of spheres, which represent atoms, each represented by an x, y and z coordinate. This is an approximation. Atoms are not rigid spheres, they are rather … [Quantum explanations] … I think I should not explain this here. Perhaps in another series/post we can talk about atoms and their orbitals. For now it is sufficient to say that from the point of view of classical mechanics, which is how we analyse these structures, it is sufficient to assume atoms as solid spheres represented by a center which in 3D-Cartesian space can be represented by an x, y and z co-ordinate. The types of analysis include comparing protein structures. Comparison can be done at atomic level, secondary structure level or the overall globular shape. Another type of analysis can be modelling a structure, or simulating it. For any of these applications (to a certain degree) the crude approximation of representing atoms as spheres is sufficient.
OK, so, in summary, atoms come together to make amino acids which come together to make proteins which fold and fold to take a shape which carries out functions to support life. So the next question is why do we need to know their structure? This has been discussed in some detail in the previous post. In summary, the structure controls the function. Therefore to understand how a function is undertaken it is important to look at its structural underpinnings. This is where experimental methods come in. While it may be open to debate, “seeing” is perhaps the most important of the basic senses. Therefore instead of just understanding protein structures as abstract concepts we try to give their atoms rigid sphere like representation and observe them in Cartesian space to develop an intuitive feel about them. This then permeates into the analysis … I think I am digressing into philosophy now … so back to business.
The question now is how to see something that we cannot see using light? The answer to that has just been discussed, by representing atoms as spheres and visualizing them in 3D. The next question is how to find out where atoms are relative to one another? This is where we enter the methods part of this post.
Two most popular structure determination techniques make use of “X-ray Crystallography” and “Nuclear Magnetic Resonance (NMR)”. RCSB (https://www.rcsb.org) is a repository of experimentally determined protein structures and we will be making extensive use of it in this series. As of today, 13th February 2019, of the 148969 in total, 133224 were determined through X-ray and 12526 entries by NMR. Between the two of these methods, they cover ~98% of the PDB (we will talk about PDB in a while) database, see Figure 1 and 2.
This post does not cover the knitty-gritty details of these methods, however some points worth remembering for the following posts in this series are listed below.
Proteins have to be crystallized. This process may take them away from the native environment. The shape in the crystal may not be the same as the shape of the protein in its native environment. Therefore the data obtained and hence all subsequent analysis may be wrong. (Apologies to sound ominous).
- Structures solved through the X-ray method have an associated resolution. Small numbers (< 3 Ang) are better.
- X-rays can only detect heavy atoms and therefore position of “hydrogen” on main and side chains is not resolved.
Solution-state NMR is better than X-ray, because it does not need the crystallization step required for the previous method. Given that multiple proteins are present in the solution, all in different structures, the NMR data can be solved to generate multiple conformations.
- This method generates structures in which the position of light atoms can be determined.
Public protein structure databases:
So now, we have rationalized the need to determine the structure, made use of X-ray (which can detect atoms, something which visible light cannot) or NMR (uses a completely different approach, might come back to this in a future post) to generate coordinates, so that we can look at a protein’s structure.
While we would not be doing any of these steps, we will however use data generated by other research groups. The best places to look for that data are public databases (like RCSB). But some more background first before I explain the data in these public databases. Proteins come from genes. Genomic sequences of proteins comprise of nucleotides, i.e. A, T, C and G and far outnumber their corresponding protein structures. There are many databases spread across the internet, each with its specialty, a collection of which can be found under National Center for Biotechnology Information (NCBI). This post however does not cover those. The intention of mentioning these here is that several routes can be taken when searching for data of your interest. One way to search for your protein is to take your protein’s sequence and find its counterpart in UniprotKB. Another would be to directly use it with RCSB. Since most of the work we will be undertaking revolves around protein structures we will look only at RCSB. One more thing to note here is that when searching for your data in RCSB you will come across four letter codes e.g. 1bg4 etc. These codes are unique identifiers to every entry in the database. Another thing to remember is that the data in the RCSB database comes in many formats one of which is called “PDB” which holds amongst other things the structure of the protein. So in this series I will occasionally be referring to protein structures as PDBs and where we have to look at a specific protein it will be accompanied by a four-letter code.
Although a detailed manual of the RCSB database is available on their website, I would like to mention a couple of tricks. The easiest is to search by the four-letter PDB code itself. Other methods include searching by sequence (no spaces), molecule name (e.g. myoglobin) and even species name (Homo sapiens). In the following posts, I will be making use of explicit four-letter codes directing you to the exact protein.
Also note that the RCSB has additional resources linked to it – so once you have located your protein of interest you can look at accompanying data which can tell you things like:
- Experimental method, resolution, organism the protein belongs to
- Mapping to the exact gene
- Other proteins similar to your protein of interest by sequence
- Other proteins similar to your protein of interest by structure
- Ligands if any bound to your protein
- Size of your protein
- Additional annotations from other databases like SCOP and CATH.
These are all I can remember from the top of my head, so should not be considered an exhaustive list.
I will end this post now. The next post in this series will start by introducing the format used by RCSB in its structure files (PDBs) along with the first use of programming to interact with that data (apart from simply eyeballing it).