4.4: Genome Analysis by Large Scale Sequencing

Last updated
Save as PDF

Page ID: 313

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Whole genomes can be sequenced both by random shot-gun sequencing and by a directed approach using mapped clones.

Figure 4.11. Directed sequencing of BAC contigs.

The results of the Celera and public collaboration on the fly sequence was published in early 2000, and descriptions of the human genome sequence were published separately by Celera and IHGSC in 2001. Neither genome is completely sequenced (as of 2001), but both are highly sequenced and are stimulating a major revolution in the life sciences.

The wisdom of which approach to take is still a matter of debate, and depends to some extent on how thoroughly one needs to sequence a complex genome. For instance, a publicly accessible sequence of the mouse genome at 3X coverage was recently generated by the shotgun approach. Other genomes will likely be “lightly sequenced” at a similar coverage. But a full, high quality sequence of mouse will likely use aspects of the more directed approach. Also, the Celera assembly (primarily shotgun sequence) used the public data on the human genome sequence as well. Thus current efforts use both the rapid sequencing by shotgun methods and as well as sequencing mapped clones.

Survey of sequenced genomes

The genome sequences are available for many species now, covering an impressive phylogenetic range. This includes more than 28 eubacteria, at least 6 archaea, a fungus (the yeast Saccharomyces cerevisiae), a protozoan (Plasmodium falciparum), a worm (the nematode Caenorhabditis elegans), an insect (the fruitfly Drosophila melanogaster), two plants (Arabadopsisand rice (soon)), and two mammals (human Homo sapiensand mouse Mus domesticus). Some information about these is listed in Table 4.4.

Table 4.4.Sequenced genomes. This table is derived from the listing of “Complete Genomes Mapped on the KEGG Pathways (Kyoto Encyclopedia of Genes and Genomes)” at

www.genome.ad.jp/kegg/java/org_list.html

Additional genomes have been added, but only samples of the bacterial sequences are listed.

Genes encoding

Species	Genome Size (bp)	Protein	RNA	Total Enzymes	Category
Eubacteria
Escherichia coli	4,639,221	4,289	108	1,254	gram negative
Haemophilus influenzae	1,830,135	1,717	74	571	gram negative
Helicobacter pylori	1,667,867	1,566	43	394	gram negative
Bacillus subtilis	4,214,814	4,100	121	819	gram positive
Mycoplasma genitalium	580,073	467	36	202	gram positive
Mycoplasma pneumoniae	816,394	677	33	226	gram positive
Mycobacterium tuberculosis	4,411,529	3,918	48	-	gram positive
Aquifex aeolicus	1,551,335	1,522	50	-	hyperthermophilic bacterium
Borrelia burgdorferi	1,230,663	1,256	23	176	lyme disease Spirochete
Synechocystis sp.	3,573,470	3,166	49	702	cyanobacterium
Archaebacteria
Archaeoglobus fulgidus	2,178,400	2,407	49	439	S-metabolizing archaea
Methanococcus jannaschii	1,739,934	1,735	43	441	archaea
Methanobacterium thermoautotrophicum	1,751,377	1,871	47	558	archaea
Eukaryotes
Saccharomyces cerevisiae	12,069,313	6,064	262	861	fungi
Caenorhabditis elegans	97,000,000	18,424		-	nematode
Drosophila melanogaster	180,000,000	13,601			insect, fly, 120 Mb sequenced
Arabidopsis thaliana	115,500,000	25,706			plant, complete
Homo sapiens	3,200,000,000	30,000-40,000			human, draft + finished
Mus domesticus	3,000,000,000				mouse, draft

Genome size

Bacterial genomes range in size from 0.58 to almost 5 million bp (Mb). E. coli and B. subtilis, two of the most intensively studied bacteria, have the largest genomes and largest numbers of genes. The genome of the yeast Saccharomyces cerevisiae is only 2.6 times as large as that of E. coli. The genome of humans is almost 700 times larger than that of E. coli. However, genome size is not a direct measure of genetic content over long phylogenetic distances. One needs to examine the fraction of the genome that codes for protein or contains other important information. Let’s look at sizes and numbers of genes in different genomes.

Gene size and number

The average gene size is similar among bacteria, averaging around 1100 bp. Very little DNA separates most bacterial genes; in E. colithere is an average of only 118 bp between genes. Since the gene size varies little, then the number of genes varies over as wide a range as the genome size, from 467 genes in M. genitaliumto 4289 in E. coli. Thus within bacteria, which have little noncoding DNA, the number of genes is proportional to the genome size.

Saccharomyces cerevisiaehas one gene every 1900 bp on average, which could reflect both an increase in size of gene as well as somewhat greater distance between genes. Both bacteria and yeast show a much denser packing of genes than is seen in more complex genomes.

Data on a large sample of human genes shows that they are much larger than bacterial genes, with the median being about 14 times larger than the 1 kb bacterial genes. This is not because most human proteins are substantially larger; both bacterial proteins average about 350 amino acids in length, which is similar to the median size of human proteins. The major difference is the large amount of intronic sequence in human genes.

Table 4.5.Average size of human genes and parts of genes. This is based on information in the IHGSC paper in Nature, and derived from analysis of 1804 human genes.

	Median	Mean
Internal exon	122 bp	145 bp
Number of exons	7	8.8
Length of each intron	1023 bp	3365 bp
3’ UTR	400 bp	770 bp
5’ UTR	240 bp	300 bp
Coding sequence	1100 bp	1340 bp
Length of protein encoded	367 amino acids	447 amino acids
Genomic extent	14,000 bp	27,000 bp

Figure 4.14. Genome size and number of genes in species ranging from bacteria to humans.

Alternative splicing is common in human genes

A previous lower estimate is that alternative splicing occurs in 35% of human genes. However, recent data show this fraction is larger.

For Chromosome 22:

642 transcripts cover 245 genes, 2.6 txpts/gene
2 or more transcripts for 145 (59%) of genes

For Chromosome 19:

1859 transcripts cover 544 genes, 3.2 txpts/gene

This contrasts with the situation in worm, in which alternative splicing occurs in 22% of genes. The increased genetic diversity from alternative splicing may contribute considerably to the greater complexity of humans, not just the increase in the number of genes.

Estimates of number of human genes

The estimated number of human genes has varied greatly over recent years. Some of these numbers have been widely quoted, and it may be useful to list some of the sources of these estimates.

mRNA complexity (association kinetics): 40,000 genes
Avg size of gene 30,000 bp: 100,000 genes
Number of CpG islands: 70,000 to 80,000
Unigene clusters of ESTs: 35,000 to 125,000
More rigorous EST clustering: 35,000 genes
Comparison to pufferfish: 30,000 genes
Extrapolate from gene counts on chromosomes 21 and 22 (which are finished): 30,000 to 35,500 genes

Using the draft human sequence from Juy 2000, the IHGSC constructed an Initial Gene Index for human. They use the Ensembl system at the Sanger Centre. They started with ab initio predictions by Genscan, then confirmed by similarity to proteins, mRNAs, ESTs, and protein motifs (Pfam database) from any organism. This led to an initial set of 35,500 genes and 44,860 transcripts in the Ensemble database. After reducing fragmentation, merging with known genes, and removing contaminating bacterial sequences, they were left with 31,778 genes. After taking into account residual fragmentation, and the rate at which true genes are found by a similar analysis, the estimate remains about 32,000 genes. However, it is an estimate and is subject to change as more annotation is completed..

Starting with this estimate that the human genome contains about 32,000 genes, one can calculate how much of the genome is coding and how much is transcribed. If the average coding length is 1400 bp, then 1.5%of human genome consists of coding sequence. If the average genomic extent per gene is 30 kb, then 33% of human genome is “transcribed”.

Summary of number of genes in eukaryotic species:

Human: 32,000 “still uncertain”
Fly: 13, 338
Worm: 18,266
Yeast: 6,144
Mustard weed: 25,706
Human: 2x number of genes in fly and worm
Human: more alternative splicing, perhaps 5x number of proteins as in fly or worm

Assignment of functions to genes

Genes encoding proteins and RNAs can be detected with considerable accuracy using compuational tools. Note that even for an extensively studies organism like E. coli, the number of genes found by sequence analysis (4289 encoding proteins) is far greater than the number that can be assigned as encoding a particular enzyme (1254). The discrepancy between genes found in the sequence versus those with known function (i.e. assigned as encoding an enzyme) is greater for some poorly characterized organisms such as the lyme-disease causing Spirochete Borrelia burgdorferi.

The many genes with unassigned function present an exciting challenge both in bioinformatics and in biochemistry/cell biology/genetics. Large collaborations have been initiated for a comprehensive genetic and expression analysis of some organisms. For instance, projects are underway to make mutations in all detected genes in Saccharomyces cerevisiae and to quantify the level of stable RNA from each gene in a variety of growth conditions, through the cell cycle and in other conditions. Databases are already established that record the changes in RNA levels for all yeast genes when the organism is shifted from glucose to galactose as a carbon source. These large scale expression analysis use high density microchip arrays that contain characteristic sequences for all 6064 yeast genes. These gene arrays are then hybridized with fluorescently labeled RNA or cDNA from cells grown under the two different conditions. The hybridization signals are quantitated and compared automatically, analyzed. The plan is to store the results in public databases. Useful websites include:

SGD
MIPS: a database for genomes and protein sequences