Shortly after their press conferences, the two groups that had been striving for several years to map the human genome published their findings:
- The International Human Genome Sequencing Consortium (IHGSC) in the 15 February 2001 issue of Nature
- Celera Genomics, a company in Rockville, Maryland, in the 16 February issue of Science
These achievements were monumental, but before we examine them, let us be clear as to what they were not.
What was not found
- Neither group had determined the complete sequence of the human genome. Each of our chromosomes is a single molecule of DNA. Some day the sequence of base pairs in each will be known from one end to the other. But in 2001, thousands of gaps remained to be filled. What they had done was present a series of draft sequences that represented about 90% (probably the most interesting 90%) of the genome.
- Even taken together, the results did not provide an accurate count of the number of protein-encoding genes in our genome (in contrast to such genomes as those of mitochondrial DNA, the Epstein-Barr virus and many of the bacterial genomes.
One reason: the large number and large size of the introns that split these genes make it difficult to recognize the open reading frames (ORFs) that encode proteins.
The number of genes were much smaller than predicted
The two groups came up with slightly different estimates of the number of protein-encoding genes, but both in the range of 30 to 38 thousand:
- barely two times larger than the genomes of
- Drosophila (~17,000 genes)
- C. elegans (<22,000 genes)
- and representing only 1– 2% of the total DNA in the cell;
- and a third of the 100,000 genes that many had predicted would be found.
- (By 2011, the number had been reduced to some 21,000.)
Are the tiny roundworm and fruit fly almost as complex as we are?
Probably not, although we share many homologous genes (called "orthologs") with both these animals. But many of our protein-encoding genes produce more than one protein product (e.g., by alternative splicing of the primary transcript of the gene). On average, each of our ORFs produces 2 to 3 different proteins. So the human "proteome" (our total number of proteins) may be 10 or more times larger than that of the fruit fly and roundworm.
A larger proportion of our genome encodes transcription factors and is dedicated to control elements (e.g., enhancers) to which these transcription factors bind. The combinatorial use of these elements probably provides much greater flexibility of gene expression than is found in Drosophila and C. elegans.
Gene diversity and density
There are some giants such as dystrophin with its 79 exons spread over 2.4 million base pairs of DNA and titin whose 363 exons can encode a single protein with as many as ~38,000 amino acids. The average human gene contains 4 exons totaling 1,350 base pairs and thus encodes an average protein of 450 amino acids. The density of genes on the different chromosomes varies from 23 genes per million base pairs on chromosome 19 (for a total of 1,400 genes) to only 5 genes per million base pairs on chromosome 13.
Humans have many genes not found in invertebrates
Humans, and presumably most vertebrates, have genes not found in invertebrate animals like Drosophila and C. elegans. These include genes encoding:
- antibodies and T cell receptors for antigen (TCRs)
- the transplantation antigens of the major histocompatibility complex (MHC) (HLA, the MHC of humans)
- cell-signaling molecules including the many types of cytokines
- the molecules that participate in blood clotting
- mediators of apoptosis. Although these proteins occur in Drosophila and C. elegans, we have a much richer assortment of them.
Both groups added to the list of human genes that have arisen by repeated duplication (e.g., by unequal crossing over) from a single precursor gene; for examples, the genes (several hundred) for olfactory receptors and the various globin genes.
Both groups verified the presence of large amounts of repetitive DNA. In fact, this DNA — with similar sequences occurring over and over — is one of the main obstacles to assembling the DNA sequences in proper order.
- LINES (long interspersed elements)
- SINES (short interspersed elements) including Alu elements
- DNA transposons
All told, repetitive DNA probably accounts for over 50% of our total genome.
What remains to be done?
- Keep looking for genes.
As of March 2010, 19,956 protein-encoding genes had been positively identified, but there probably are a thousand or more still to be found.
- Determine the human proteome; that is, the total complement of proteins we synthesize.
- Analyze how clusters of genes are coordinately expressed
- in various types of cells
- at different times in the life of a cell.
- Determine the genomes of other vertebrates.
This will not only help us recognize more human genes but will give us insight into what makes us unique.
Already we know that large sections of our genome have closely-related homologs in the mouse.Examples:
- The collection of genes — and even their order — on human chromosome 17 matches closely those of mouse chromosome 11. The same is true of human chromosome 20 and mouse chromosome 2.
- Humans and mice (also rats) share several hundred absolutely identical stretches of DNA extending for 200–800 base pairs.
- Some are present in the exons of genes, especially genes involved in RNA processing.
- Some are found in or near the introns of genes, especially genes encoding proteins involved in DNA transcription.
- Some are found between genes — especially those, like Pax6, essential to embryonic development — and may serve as enhancers.