Inferring Protein Sequences from Nucleic Acid Sequences
The acquisition of protein sequence information by chemical methodology reached its peak in the late 1970s. However, even with many improvements in this technology, it remained a relatively labor-intensive and time-consuming endeavor.
Thus, the codevelopment of cloning techniques around that time that allowed the preparation of DNA copies of mRNA (called complimentary or cDNA) and reliable DNA sequencing techniques that were faster and cheaper, which spawned the methodology of recombinant DNA (Berg, 1993), soon displaced chemical methodology as a source of protein sequence information.
Two different methods for sequencing DNA were independently developed by Sanger (1993) and Gilbert (1993) for which they shared the Nobel Prize in Chemistry for 1980. Sanger's procedure eventually was adopted as the method of choice. Thus he established the methodology for determining the sequences of both proteins and nucleic acids, an achievement of inestimable importance to further studies in cell and molecular biology.
Initially, knowledge of at least a portion of the amino acid sequence of a target protein was necessary in order to synthesize DNA probes to screen appropriate libraries to find the corresponding cDNA sequence and the techniques of protein sequencing were highly valuable in achieving this. Eventually, however, DNA manipulations, in general, and DNA sequencing, in particular, became sufficiently efficient that this information was no longer necessary and which finally culminated in the rapid determination of whole genomes, including that of Homo sapiens.
The accumulation of genome sequence data on a massive scale continues unabated. As a result the overwhelming majority of protein sequence information that has been collected and is available today, spanning the entire range of living (and some that are extinct) organisms, is based not on direct measurements but on translated nucleic acid sequences. Indeed there is still between 10 and 20% of the predicted human proteome that has not yet been directly identified in an appropriate sample by any technique (Kim et al., 2014; Wilhelm et al., 2014).
Although nucleic acid sequencing has provided a plethora of information (that would have required decades to obtain by direct protein sequencing, even with the advances in mass spectrometry), there are consequences that have presented new challenges, the most important of which is the detection of splice variants (as alternatively expressed forms of a given gene) and PTMs.
While the number of genes predicted to make up the human genome plummeted during the process of elucidating it (from somewhere -150 000 to the eventually determined number of -20 000), the number of splice variants and particularly the extent of PTM in the actual expressed proteome rose dramatically.
In the end it became clear that the concept of 'one gene, one protein' (Beadle and Tatum, 1941; Berg and Singer, 2003) was a vast oversimplification and that rather than depend on a huge multiplicity of unique genes to drive biological complexity, nature simply found ways to modify a more restricted group of chemically distinct proteins (and thus increase the number of functions they could perform or participate in) through splicing and chemical modification instead. Unfortunately, simple DNA sequences are not reliable indicators of what splicing events will occur and under what conditions and consensus sequences for PTMs do not guarantee that a site will actually be modified.
Furthermore, the majority of PTMs are governed by only very loose rules regarding site characteristics and in many cases there are no discernible rules at all. Thus the objectives of protein sequence analyses have shifted from determining the order of amino acids to determining exon usage and downstream covalent modifications. Mass spectrometry, in its various manifestations, is overwhelming the method of choice for such analyses.
Date added: 2024-06-13; views: 183;