G1. Introduction to Bioinformatics, Computational Biology and Proteomics
- Page ID
- 4784
With the solving of the human genome, intensive effort has been devoted to analysis of the human genome to determine the number and transcriptional regulation of the encoded genes. Much has been learned from comparative genomics, as genomes from mice, rats, chimpanzees, and a variety of prokaryotes are compared in an effort to help understand the nature of genes and their transcriptional regulation. The vast amount of genomic data that has to be "mined" has required the development of computational and computer programs to enable the analysis. Two relatively new fields have subsequently arisen: bioinformatics and computational biology. (In a personal note, the words computational biology seem somewhat restrictive since the field of computational chemistry, which has a longer history, has significant overlap with "computational biology". I prefer computational biochemistry). These fields have significant overlap (as do physical chemistry/chemical physics and biochemistry/molecular biology/chemical biology), so I defer to others to define them.
The NIH Biomedical Information Science and Technology Initiative Consortium: "This consortium has agreed on the following definitions of bioinformatics and computational biology, recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations.
Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems."
This web book has been developed as a first semester biochemistry text and choices have been made to limit the scope of the material to exclude content covered in detail in a molecular biology/genetics class. Hence, this text will not discuss in significant detail the genome and transcriptome, and mechanisms of replication, transcription, or translation. However, with its emphasis on protein structure and function, proteomics, the characterization of structure and function of all proteins within a cell, is a logical candidate for inclusion.
In the last several years, computational biology/chemistry and web-based programs have become available for the systematic analysis of individual proteins, and for the comparative analysis of many proteins, based on either their DNA or amino acid sequence. Clearly the ultimate goal in the description of a protein would be to determine, from the amino acid or nucleotide sequence, the three dimensional structure of a protein and its biological function, including all its binding partners.
Here is a list of proteome web resources and tutorials
- Bioinformatics and Homology Modeling: A Student-Tested Tutorial for Beginners
- ExPASy Proteomics Portal
- Animations: Proteins and Proteomics
- Protein Matchmaking - Protein Data Base Search Engine: allows superposition of similar protein structures
- SIB: Swiss Institute of Bioinformatics:
- Protein Structure and Proteome Analysis
Voluminous databases of biomolecule sequence and structural data, as well as analysis software packages, are available at a variety of web sites, including:
- BioGrid: General Repository for Interaction (protein, NA) Datasets
- GenBank: DNA sequence database (over 100 billion bases as of 9/05), from the NCBI
- BLAST finds regions of similarity between biological sequences
- UniProtKB/Swiss-Prot: manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB)
- ProSite: database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. From the Swiss Institute of Bioinformatics
- Swiss-2D Gel Database: from the Swiss Institute of Bioinformatics
- RSCB Protein Data Bank: Protein and nucleic acid 3D structures from x-ray crystallography and NMR spectroscopy (about 33,000 as of 9/15/05)
- SWISS-MODEL Repository: 3D comparative protein structure models (675,000) generated by the fully automated homology-modeling pipeline SWISS-MODEL. (again from Swiss Institute of Bioinformatics)
- ExPASy (Expert Protein Analysis System) server of the Swiss Institute of Bioinformatics
The NCBI has an extensive array of available tools (free), including:
- literature databases: including word searches in many books
- All resources: including nucleotide, protein, structure, genome, chemical
- Entrez: the life science search engine
- Blast Quick Start: easy way to start a BLAST search
- complete human proteome from UniProtKB/Swiss-Prot
A summary of three important sites:
• NCBI-Protein: The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function
• Uniprot: The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added.
• Gene Card: GeneCards is a searchable, integrative database that provides comprehensive, user-friendly information on all annotated and predicted human genes. It automatically integrates gene-centric data from ~125 web sources, including genomic, transcriptomic, proteomic, genetic, clinical and functional information
The table below (directly taken from Wikipedia) shows some of the incredible information available the proteome and genome of each human chromosome.
Table: Human proteome and genome from Wikipedia
(Data source: Ensembl genome browser release 68, July 2012)
Chromsome | Length (mm) | BP | Variations | Confirmed Proteins | Putative Proteins | Pseudogenes | miRNA | rRNA | snRNA | snoRNA | misc ncRNA | Links |
1 | 85 | 249,250,621 | 4,401,091 | 2,012 | 31 | 1,130 | 134 | 66 | 221 | 145 | 106 | EBI |
2 | 83 | 243,199,373 | 4,607,702 | 1,203 | 50 | 948 | 115 | 40 | 161 | 117 | 93 | EBI |
3 | 67 | 198,022,430 | 3,894,345 | 1,040 | 25 | 719 | 99 | 29 | 138 | 87 | 77 | EBI |
4 | 65 | 191,154,276 | 3,673,892 | 718 | 39 | 698 | 92 | 24 | 120 | 56 | 71 | EBI |
5 | 62 | 180,915,260 | 3,436,667 | 849 | 24 | 676 | 83 | 25 | 106 | 61 | 68 | EBI |
6 | 58 | 171,115,067 | 3,360,890 | 1,002 | 39 | 731 | 81 | 26 | 111 | 73 | 67 | EBI |
7 | 54 | 159,138,663 | 3,045,992 | 866 | 34 | 803 | 90 | 24 | 90 | 76 | 70 | EBI |
8 | 50 | 146,364,022 | 2,890,692 | 659 | 39 | 568 | 80 | 28 | 86 | 52 | 42 | EBI |
9 | 48 | 141,213,431 | 2,581,827 | 785 | 15 | 714 | 69 | 19 | 66 | 51 | 55 | EBI |
10 | 46 | 135,534,747 | 2,609,802 | 745 | 18 | 500 | 64 | 32 | 87 | 56 | 56 | EBI |
11 | 46 | 135,006,516 | 2,607,254 | 1,258 | 48 | 775 | 63 | 24 | 74 | 76 | 53 | EBI |
12 | 45 | 133,851,895 | 2,482,194 | 1,003 | 47 | 582 | 72 | 27 | 106 | 62 | 69 | EBI |
13 | 39 | 115,169,878 | 1,814,242 | 318 | 8 | 323 | 42 | 16 | 45 | 34 | 36 | EBI |
14 | 36 | 107,349,540 | 1,712,799 | 601 | 50 | 472 | 92 | 10 | 65 | 97 | 46 | EBI |
15 | 35 | 102,531,392 | 1,577,346 | 562 | 43 | 473 | 78 | 13 | 63 | 136 | 39 | EBI |
16 | 31 | 90,354,753 | 1,747,136 | 805 | 65 | 429 | 52 | 32 | 53 | 58 | 34 | EBI |
17 | 28 | 81,195,210 | 1,491,841 | 1,158 | 44 | 300 | 61 | 15 | 80 | 71 | 46 | EBI |
18 | 27 | 78,077,248 | 1,448,602 | 268 | 20 | 59 | 32 | 13 | 51 | 36 | 25 | EBI |
19 | 20 | 59,128,983 | 1,171,356 | 1,399 | 26 | 181 | 110 | 13 | 29 | 31 | 15 | EBI |
20 | 21 | 63,025,520 | 1,206,753 | 533 | 13 | 213 | 57 | 15 | 46 | 37 | 34 | EBI |
21 | 16 | 48,129,895 | 787,784 | 225 | 8 | 150 | 16 | 5 | 21 | 19 | 8 | EBI |
22 | 17 | 51,304,566 | 745,778 | 431 | 21 | 308 | 31 | 5 | 23 | 23 | 23 | EBI |
X | 53 | 155,270,560 | 2,174,952 | 815 | 23 | 780 | 128 | 22 | 85 | 64 | 52 | EBI |
Y | 20 | 59,373,566 | 286,812 | 45 | 8 | 327 | 15 | 7 | 17 | 3 | 2 | EBI |
mtDNA | 0.0054 | 16,569 | 929 | 13 | 0 | 0 | 0 | 2 | 0 | 0 | 22 | EBI |
This chapter will describe programs that allow predictions of secondary and tertiary structures of proteins. Specific exercises using web-based bioinformatics programs can be found at the end.
Contributors and Attributions