Bioinformatics, Computational Biology and Proteomics
With the solving of the human genome, intensive effort has been devoted to analysis of the human genome to determine the number and transcriptional regulation of the encoded genes. Much has been learned from comparative genomics, as genomes from mice, rats, chimpanzees, and a variety of prokaryotes are compared in an effort to help understand the nature of genes and their transcriptional regulation. The vast amount of genomic data that has to be "mined" has required the development of computational and computer programs to enable the analysis. Two relatively new fields have subsequently arisen: bioinformatics and computational biology. (In a personal note, the words computational biology seem somewhat restrictive since the field of computational chemistry, which has a longer history, has significant overlap with "computational biology". I prefer computational biochemistry). These fields have significant overlap (as do physical chemistry/chemical physics and biochemistry/molecular biology/chemical biology), so I defer to others to define them.
The NIH Biomedical Information Science and Technology Initiative Consortium: "This consortium has agreed on the following definitions of bioinformatics and computational biology, recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations.
Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems."
The National Center for Biotechnology Information (NCBI 2001) offers this definition of bioinformatics:
bioinformatics: "Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to assess relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information."
What comes after the solving of the genome? The transcriptome, the complete set of transcribed RNA sequences and their biological functions, and the immensely complex proteome, the complete set of translated protein, are obvious candidates. Here are some definitions of proteomics:
Proteomics: "the qualitative and quantitative comparison of proteomes (PROTEin complement to a genOME) under different conditions to further unravel biological processes" from ExPASy (Expert Protein Analysis System) proteomics server of the Swiss Institute of Bioinformatics
Proteomics (Pasteur Institute): "Proteomics aims at quantifying the expression levels of the complete protein complement (the proteome) in a cell at any given time. While proteomics research was initially focused on two-dimensional gel electrophoresis for protein separation and identification, proteomics now refers to any procedure that characterizes the function of large sets of proteins. It is thus often used as a synonym for functional genomics."
Richard Burgess, UW Madison, includes the following activities in proteomics: (C&E New, July 31, 2000, pg 31) , which will revolutionize our understanding of normal and disease processes in cells.
- High-throughput expression and purification of proteins
- Protein profiling, using 2D gel electrophoresis and mass spectrometry to study proteins expressed in a cell
- Protein-protein interaction studies to see which proteins function together using the yeast two hybrid method
- Pathway analysis to understand signal transduction and other complex cell processes
- Large scale protein folding and 3D structure studies
- Bioinformatics analysis of proteomic data
This web book has been developed as a first semester biochemistry text. I have made a conscious choice to limit the scope of the material to exclude content covered in detail in a molecular biology/genetics class. Hence, this text will not discuss in significant detail the genome and transcriptome, and mechanisms of replication, transcription, or translation. However, with its emphasis on protein structure and function, proteomics is a logical candidate for inclusion.
In the last several years, computational biology/chemistry and web-based programs have become available for the systematic analysis of individual proteins, and for the comparative analysis of many proteins, based on either their DNA or amino acid sequence. Clearly the ultimate goal in the description of a protein would be to determine, from the amino acid or nucleotide sequence, the three dimensional structure of a protein and its biological function, including all its binding partners. Here is a list of typical properties of a protein that can be determined by input of an appropriate sequence (for a protein with known or unknown 3D structure) into web-based programs:
- protein sequence from a DNA sequence, and the reverse
- isoelectric point
- Ramachandrian plot
- glycosylation/phosphorylation sites
- secondary structure prediction
- hydrophobicity prediction
- 3D structure based on structures of homology protein (homology modeling)
- determination of evolutionary relationships among organisms.
Here is a list of proteome web resources and tutorials
- Proteomics Portal
- Primers in Bioinformatics | Bioinformatics - Beginning Tutorial
- Bioinformatics Tutorial: NCBI
- ExPASy Proteomics tools
- Human Proteomics Initiative
- SWISS-PROT Protein Knowledgebase User Manual (see 3.1.1 for entry names)
- Protein Atlas: Track the location of proteins in cells
- Mouse Proteome: Location of proteins
- Proteomics in Genomeland: Science, Vol 291, Feb 16, 2001
- Orbigen: Enabling the Proteomics Revolution
- Animations: Proteins and Proteomics
- Protein Matchmaking - Protein Data Base Search Engine: allows superposition of similar protein structures
- Protein Structure Bioinformatics Resources
Computational biochemistry programs (such as Insight II, MOE, SwissPdb Viewer, VMD, NAMD, Autodock) are available to calculate surface electrostatic potentials, minimize energy, dock ligand molecules, and perform molecular dynamics simulations.
- Protein identification and characterization
- DNA -> Protein
- Similarity searches
- Pattern and profile searches
- Post-translational modification prediction
- Topology prediction
- Primary structure analysis
- Secondary structure prediction
- Tertiary structure
- Sequence alignment
- Biological text analysis
Voluminous databases of biomolecule sequence and structural data, as well as analysis software packages, are available at a variety of web sites, including:
- BioGrid: General Repository for Interaction (protein, NA) Datasets
- GenBank: DNA sequence database (over 100 billion bases as of 9/05), from the National Center for Biotechnology Information - NCBI
- Swiss-Pro: protein sequence database with annotation (description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), from the Swiss Institute of Bioinformatics
- ProSite: database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. From the Swiss Institute of Bioinformatics
- Swiss-2D Gel Database: from the Swiss Institute of Bioinformatics
- RSCB Protein Data Bank: Protein and nucleic acid 3D structures from x-ray crystallography and NMR spectroscopy (about 33,000 as of 9/15/05)
- SWISS-MODEL Repository: 3D comparative protein structure models (675,000) generated by the fully automated homology-modeling pipeline SWISS-MODEL. (again from Swiss Institute of Bioinformatics)
- Web Links to SBI sites
The NCBI has an extensive array of available tools (free), including:
- literature databases: including word searches in many books
- molecular databases: including nucleotide, protein, structure, genome, chemical
- Entrez: the life science search engine
- Blast Quick Start: easy way to start a BLAST search
The table below shows some of the incredible information available the proteome and genome of each human chromosome.
Table: Human proteome and genome from the Human Proteome Initiative, Swiss Institute of Bioinformatics
This chapter will focus on predictions of secondary and tertiary structures of proteins based on computation biochemistry and bioinformatics. Specific exercises (for those enrolled in the class) using web-based bioinformatics programs will be found in Laboratory and Problem Sets.
As we have seen previously, amino acids vary in their propensity to be found in alpha helices, beta strands, or reverse turns (beta bends, beta turns). These difference can be rationalized from the structure of each amino acid, as described before.
Figure: Amino Acid Structure and propensity for secondary structure
From the data bases, propensities can be calculated to determine the likelihood that a given amino acid will be in one of those structures. Glycine for example would have a high propensity to be in reverse turns, while Pro, a helix breaker, would have a low propensity to be in an alpha helix. A number is assigned to each amino acid for each category of secondary structure. High numbers favor the likelihood that that amino acid would be in that structure. One of the earliest propensity scales was from Chou-Fasman, where H indicates high propensity for secondary structure, h intermediate propensity, i is inhibitory, b is a intermediate breaker, and B is a significant breaker of secondary structure.
Chou-Fasman Amino Acid Propensities
Next a stretch of amino acids about 7 amino acids is taken, starting from the N-terminal of the protein. First the average alpha helical propensities for amino acids 1-7 are determined and assigned, let's say, to the middle (4th) amino acid in that sequence. Then alpha helical propensities for amino acids 2-8 are averaged and assigned to the middle (5) amino acid in that range. This continues until all but the first and last few amino acids have an average value assigned to them. If a contiguous stretch of amino acids has high average propensity, they are probably in an alpha helix in the native protein. This process is repeated using beta strand and reverse turn propensities. The final assignments of most probably secondary structure are made. Of course this system was tested against proteins whose tertiary structure was known. See the results for secondary structure prediction for one protein. In this example, the average propensity for four contiguous amino acids is calculated (starting with amino acids 1-4, then amino acids 5-8, etc, and continuing to the end of the polypeptide). Next this process is repeated for contiguous stretches 2-5, 6-9, etc, and continuing to the end. The original Chou Fasman propensities have up updated using known protein structure to give better predictions.
- Chou Fasman Online Secondary Structure Predictor
Additional information about putative helices can be obtained by determining if they are amphiphilic (one side of the helix containing mostly hydrophobic side chains, with the opposite side containing polar or charged side chains. A helical wheel projection can be made. In this a circle is draw representing a downward cross-sectional view of the helix axis.
Figure: Helical wheel projection
The side chains are placed on the outside of the circle, staggered in a fashion determined by the fact that there are 3.6 amino acids per turn of the helix. If one side of the wheel contains predominantly nonpolar side chains while the other side has polar side chains, the helix is amphiphilic. Imagine how such helices might be packed in a protein.
In a completely analogous fashion, a hydrophobic propensity or hydopathy can be calculated. In this system, empirical measures of the hydrophobic nature of the side chains are used to assign a number to a given amino acid. Many hydropathy scales are used. Several are based on the Dmo transfer of the side chains from water to a nonpolar solvent. Two commonly used scales are the Kyte-Doolittle Hydropathy and Hopp-Woods scales (used more like a hydrophilicity index to predict surface or water accessible structures that might be recognized by the immune system)
Hydrophobicity Indices for Amino Acids
For a water-soluble protein, a continuous stretch of amino acids found to have a high average hydropathy is probably buried in the interior of the protein. Consider the example of bovine a-chymotrypsinogen, a 245 amino acid protein, whose sequence is shown below in single letter code.
A hydrophathy plot for chymotrypsinogen (sum of hydropathies of seven consecutive residues) shows many stretches that are presumably buried in the interior of the protein.
Figure: hydrophathy plot for chymotrypsinogen
So far we have discussed predominantly globular proteins that are soluble in water. Proteins are also found associated with membranes. Two major classes of membrane proteins are found in nature.
- peripheral membrane proteins: water soluble proteins bound reversibly and non-covalently to the membrane through electrostatic attractions between charged polar head groups of the phospholipids and the protein. These proteins can often be released from the membrane by addition of high salt, since they are often attracted to the bilayer by electrostatic interactions between charged phospholipid head groups and polar/charged groups on the protein surface.
- integral membrane proteins: actually insert into the bilayer. These can be released from the membrane and effectively solubilized by the addition of single chain amphiphiles (detergents) which form a mixed micelle with the integral membrane protein. Nonionic detergents (Trition X-100, octylglucoside, etc) are often used in the purification of membrane proteins. Ionic detergents (like SDS) not only solubilize the integral membrane proteins, but also denature them.
Figure: Types of membrane proteins
In some of these integral membrane proteins, large extracellullar and intracellular domains of the protein are present, connected by the intramembrane regions. The intramembrane spanning region often consists of either a single alpha helix, or 7 different helical regions which zig-zag through the membrane. These transmembrane sequences can readily be determined through hydropathy calculations. For example, consider the integral membrane bovine protein rhodopsin. Its 348 amino acid sequence (in single letter code) is shown below:
Rhodopsin hydropathy plot calculations shows that is contains seven transmembrane helices which wind through the membrane in a serpentine fashion..
Figure: Rhodopsin hydropathy plot
Figure: seven transmembrane helices
Rhodopsin Hydropathy Results
|No.||N terminal||transmembrane region||C terminal||type||length|
Membrane proteins call be solubilized by addition of single chain amphiphiles (detergents). The nonpolar tails of the detergents interact with the hydrophobic transmembrane domain of the membrane protein forming a "mixed" micelle-like structure. Nonionic detergents like Triton X-100 and octyl-glucoside are often used to solubilize membrane proteins in their near native state. In contrast, ionic detergents like sodium dedecyl sulfate (with a negatively charged head group) denature proteins during the solubilization process. To study membrane proteins in a more native-like environment, proteins solubilized by nonionic detergent can be reconstituted into bilayer liposome structures using methods similar to those from Lab 1 in which you prepared dye-capsulated large unilamellar vesicles (LUVs). However, it can be difficult to study the intra- and extracellular domains of membrane proteins in liposomes, given that one of those domains is hidden inside the liposome. A novel technique that removes this barrier was recently developed by Sligar. He created an amphiphilic protein disc with an opening in the center. The inner opening is lined with nonpolar residues, while the outer surface of the disc is polar. When the discs were added to phosphlipids, small bilayers formed inside the disc. Membrane proteins like the b-2 adrenergic receptor could be reconstituted in the nanodisc bilayers, allowing solvent exposure of both the intracellular and extracellular domains of the receptor protein.
Figure: Nanodisc with membrane protein
- Experimentally Determined Hydropathy Scales
- Input amino acid sequence to determine hydropathies
- Protein Sequence Structural Features
- Membrane Protein Resources
- Membrane Proteins of Known 3D Structure
- 57 Different Amino Acid Scale Predictors from ExPASy
Protein Tertiary Structure
We are getting closer to predicting the tertiary structure of a protein, but as we have seen from molecular mechanics and dynamics calculations, it is a huge computational task. There are two basic approaches which are often combined.
- calculations using energy minimization and statistical mechanics: These "semi-empirical" techniques don't assume any given secondary structure propensities or hydrophobicities. Such methods have produced limited success with small proteins whose actual structure is known.
- homology modeling based on proteins of known structure: The structures of about 64,000(2/10) different biological macromolecules are known. This can serve as an empirical data base of possible conformations. Instead of an infinite number of prototypical structures, it is becoming clear that there may be a reasonably low number (in the hundreds) of basic structural motifs that are used over and over in nature. By aligning the amino acid sequences of different proteins, and comparing their properties (such as secondary structure propensities, hydrophobicities, etc.), probable low energy structures of the new protein can be determined. This initial structure can be run through multiple minimization and dynamic simulations to produce a tentative "lowest" energy structure. The structure should be compact (checked through calculation of packing density) and experimental techniques (such as spectroscopic methods) should be employed to validate the structure.
Many mechanisms of the actual folding process have been postulated, most of which have some experimental support. In one, a hydrophobic collapse of the protein produces a seed structure upon which secondary structure and final tertiary collapse results. Alternatively, initial formation of an alpha helix might serve as the seed structure. A combination of the two is likely. In one scenario, two small amphiphilic helices might form which interact through their nonpolar faces to produce the initial seed structure.
Many studies have been done on a domain of the protein villin. A company at Stanford University (Folding at Home) actually allows you to process protein folding data on your own computer when you're not using it (an example of distributed computing). The example below shows one simulation of length greater than 1 ms. In the simulation, it collapses to a near native-like state then unfolds again as it iteratively probes conformational space as it "seeks" the global energy minimum.
Zhou and Karplus recently simulated the folding of residues 10-55 of Staphylococcus aureus protein A which form a 3-helix bundle structure.
Figure: 3-helix bundle
Using molecular dynamics, they carried out 100 folding simulations. Two types of folding trajectories were noted.
- In the first type, helices form early (70% within 10 ns), but the fraction of native interhelical contacts (indicating proper packing of the helices together) and the overall packing density are not similar to the native state. Then the helices diffuse and collide with each (in the rate-limiting step) until the native state is reached at about 19 ms. In this model, non-obligatory intermediates can occur (due to collapse to non-native interhelical packing in the rate-limiting step) which could slow down folding.
Figure: helices form early
- In another type, there is a simultaneous and quick partial helix formation and collapse (90% at 200 ns), to a state which is similar to the molten globule. At this point, only about 20% of the native contacts are present. The final tertiary structure is achieved after a slow process of forming native contacts within the compact state, which takes about 500 ms.
Figure: simultaneous and quick partial helix formation and collapse
The Fersht lab has been combining experimental and theoretical approaches to the folding/unfolding of another three helix bundle protein, Engrailed homeodomain.
Figure: Engrailed homeodomain
This protein is among the fastest folding and unfolding proteins known (ms time scale). This time frame is also amenable to study through molecular dynamics simulations. Both sets of data support a folding pathway in which the unfolded state (U) collapses in a microsecond to an intermediate state (I) characterized by significant native secondary structure and mobile side chains that is less compact than the native state (N). The I state hence resembles the molten globule state. To more clearly understand the unfolded state, they generated a mutant (Leu16Ala) which was only marginally stable at room temperature (2.5 kcal/mol). Spectroscopic measurements (CD, NMR) showed this state to resemble the intermediate (I) state, with much native secondary structure and a 33% greater radius of gyration than the N state. In effect they could study the transient intermediate of the wild type protein more easily by making that state more stable through mutagenesis. These studies showed that the intermediate is on the folding pathway and not inhibitory to the process. Using molecular dynamic simulations, the intermediate to native state transition was shown to proceed via a transition state (TS) in which the native secondary structure is almost all present and the helices are engaged in the final packing process.
Figure: Complete Folding Pathway of Engrailed Homeodomain by Experiment and Simulation
Bradley et al (2005) have taken another step forward in prediction of tertiary structure for small protein (< 85 amino acids). They describe the two biggest stumbling blocks to such predictions as the huge number of conformations which must be explored (i.e. all of conformational space) and accurate determination of the energy of the solvated structures. Searching conformational space is difficult since the energy landscape around the global energy minimum can be very steep and sharp, since modest side chain displacements arising from subtle main chain movements cause significant side chain packing and energy changes. The narrowness of the energy well makes it difficult to find the global minimum in stochastic conformational search processes. Energy calculations also require better (more realistic) energy functions (force fields) which show the native state to be clearly differentiated as the global minimum from the denatured (non-native) states. They conducted energy calculations on many different small proteins and produced for each protein a low resolution model.To reach this low resolution model for a given protein, they found many sequence homologs of the given target protein. These homologs were naturally occurring sequence variants found by a relatively conservative BLAST sequence search, with sequence identities of 30-60 percent. They also contained insertions and deletions compared to the target sequence, which probably are involved in surface loop structures. The target and homolog sequences were folded, generating a more diverse population of low-resolution models as starting points for all-atom refinement of the structure. Then, using a new force field that stressed short range interactions (van der Waals, H-bonding), which would expected to be more important for final folding of the low resolution models than long range electrostatic forces), they were able to refine the models and condense to a final low energy that was very close in main and side chain packing to the experimental crystal structure (resolution < 1. angstroms).
The holy grail in protein folding research has always been to predict the tertiary structure of a protein given its primary sequence. A similar but conceptually easier problem is to design a protein which will fold to a given structure with predicted secondary structure. Many possible sequences could be designed to fold to the desired structure, which makes this problem easier compared to the folding of a given sequence to just one native state. Kuhlman et al. have recently accomplished such a feat for a synthetic protein of 93 amino acids which they designed to fold to a unique topology not yet observed in nature. This represents a significant advance over earlier attempts in which mimics of known proteins were made. Such structures would be expected to fold in analogous fashions to the parent protein because of the necessary constraints placed by the need to fold to a compact state.
Several web sites exist that allow users to download protein folding software onto their own PC. By distributing folding calculations to many home PC, their untapped computational power can be linked to provide the vast computational time needed to perform these calculations.
- Dill, K. and Chan H. From Levinthal Pathways to Funnels: The "New View" of Protein Folding Kinetics. Nature Structural Biology. 4, pg 10 (1997).
- K.A. Dill, S. Banu Ozkan, T.R.Weikl, J.D. Chodera and V.A. Voelz. The protein folding problem: when will it be solved?. Current Opinion in Structural Biology 17: 342--346 (2007). (PDF)
- Rosetta tackles the extreme Origami of protein folding
- Bradley, P. et al. Toward high-resolution de novo structure prediction for small proteins. Science. 309, 1868 (2005)
- Boyle J. A. Bioinformatics in Undergraduate Education. Biochemistry and Molecular Biology Education. 32, 236 (2004)
- Feig, A. L., & Jabri, E. . Incorporation of Bioinformatics Exercises into the Undergraduate Biochemistry. Biochemistry and Molecular Biology Education. 30, 224 (2002)
- Mayor et al. The complete folding pathway of a protein from nanoseconds to microseconds. Nature 421, pg 863 (2003)
- Zhou and Karplus. Interpreting the folding kinetics of helical proteins. Nature 401, pg 400(1999)