Proteins are biomolecules that carry out all the functions necessary to support life. They are synthesised using the information contained within the DNA (deoxyribonucleic acid). DNA comprises of four nucleotides cytosine [C], guanine [G], adenine [A] or thymine [T]. These four bases are repeated along the length of the DNA. Portion of these repeats make up genes, and it is these genes which are read, transcribed to mRNA (messenger ribonucleic acid) and then translated to a protein (Figure 1). While I have condensed all this in a handful of lines, a world of processes are involved at each stage. However, as the intended purpose of this article is to introduce the read to the world of protein bioinformatics, the preceding introduction should suffice.
The synthesis of a nascent protein occurs at the ribosome. At this point 3-lettered sections of the mRNA, (codons), are read by the ribosome by coupling them to their cognate tRNA (transfer RNA) and the aminoacyl group bound to the tRNA is decoupled and merged into the polypeptide chain resulting from the translation of the mRNA to a protein.
One of the questions fundamental to understanding how life evolved on Earth is to understand the so called “Genetic code”. All the proteins known comprise of 20 amino acids (Figure 2). While there are 4-letters in DNA, they are read in groups of three. Why? If they are read in groups of three then simple combinatorics suggests that there can be a total of 4 x 4 x 4 = 64 unique codons, yet they code for only 20 amino acids, i.e. in certain cases multiple codons code for the same amino acid. For instance, Arginine (R) will be incorporated into the polypeptide because of six different codons (AGG, AGA, CGA, CGC, CGG, CGU). Why? There are a huge number of amino acids, yet only 20 are used in the formation of proteins. While many mysteries surround the origin of the genetic code and still actively researched, they are not the subject of this article. The background provided in this article is intended to serve as a foundation for some of the research questions I am currently interested in regarding proteins.
The sequence of a protein refers to the amino acids that are covalently linked together into a polypeptide chain. The bottom of Figure 1 illustrates a protein, the sequence of which would be:
– Trp – Phe – Gly – Ser –
This sequence shows a small segment with each amino acid represented by its three letter code. A usual practice, instead is to use single letter identifiers for each of the amino acids. An example of a haemoglobin protein sequence from a bar-headed goose (Anser Indicus) is shown below:
The sequence of a protein is very informative. Apart from telling us the sequence of amino acids (duh!!!) it is frequently used to determine similarity.
E.g. if a protein is discovered, its sequence is determined and compared to sequences of other proteins. The amount of similarity determined from comparison allows initial characterisation of the newly discovered protein. Another use of protein sequences is to determine evolutionary relationships. A group of protein sequences from proteins carrying out similar functions in different organisms can be compared to infer molecular phylogenies which can then inform organismal phylogenies (Figure 3). Another use of protein sequence data is to simply compare a mutated protein with a normal one to identify mutation resulting in an impaired phenotype. An example of this would be between normal haemoglobin and that resulting in sickle cell diseases. Comparing proteins in these two cases can allow for identification of the mutation and its site (Figure 4).
Information determined from protein sequence characterisation can be used in the area of protein engineering. Proteins can be used for intervention purposes. In which case the area of protein engineering (amongst other) designs protein sequences which can carry out those interventive roles.
Future protein bioinformatics articles are going to build on this. Data and codes (where needed) will be provided in handling protein data and extracting different types of information from them. This article served the purpose of setting up some background. Although some information was provided regarding protein sequences in this work, future posts will see some analysis.
If your research is centred around protein sequences and you need some help, in programming or generating certain protein sequence data, feel free to get in touch. I cannot say I will be able to help to you out but there is a 63.87% chance that I might. Use information on the contact us page to get in touch. Don’t forget to like and share this post with your friends.