Skip to main content
Biology LibreTexts

4.4: Genome Analysis by Large Scale Sequencing

  • Page ID
  • Whole genomes can be sequenced both by random shot-gun sequencing and by a directed approach using mapped clones.

    A seminal advance from J. Craig Venter and his colleagues at The Institute for Genome Research in 1995 heralded a new era in genome analysis. They reported the complete sequence of the genome of the bacterium Haemophilus influenza, all 1,830,137 bp (Fleischmann et al., Science, vol. 269, pp. 496-512, 1995). In this method, genomic DNA is randomly sheared into small fragments about 1000 bp in size, cloned into plasmids, and determining the sequence from the ends of randomly picked clones (Figure 4.10). This process is repeated many times, until each nucleotide in the genome has been sequenced multiple times on average. If the genome is 3 million base pairs, then determining 9 million base pairs of sequence from random clones give 3X coverage of the genome. This is sufficient data from which an almost-complete sequence of a bacterial genome can be assembled by linking overlapping sequences, using computational tools. Some gaps remain, and these are filled with directed sequencing. Larger genomes can be sequenced (or at least a major portion of them) by going to higher coverage, e.g. 8X to 10X. This approach requires NO prior knowledge of the genes or their positions on the bacterial chromosome. Several bacterial genomes have been sequenced this way, and Dr. Venter and colleagues have used the same approach to sequence almost all of the genomes of Drosophila melanogaster(in a collaboration between his company Celera and a publicly funded effort) and Homo sapiens(in a competition with the publicly funded effort). Variations on this theme improve effectiveness, such as cloning and sequencing both small (1 kb) and large (10 kb) inserts into plasmids, and then using the sequences from the ends of the longer inserts to help assemble the overall sequence. A similar idea uses the sequence from the ends of BAC inserts, which are about 100 kb in size, for large-scale assembly.

    Figure 4.10. Shotgun sequencing and assembly.

    Other major genome sequencing projects, such as those that generated the Saccharomyces cerevisiaeand E. colisequences, started with a large set of mapped clones, which were then sequenced in a directed manner. This works well, and one has a high resolution genetic and physical map for years before the genome sequence is complete. It is slower than the random approach, but it may achieve a greater extent of completeness for large, complex genomes. This is essentially the approach that the publicly funded, international collaboration, referred to as the International Human Genome Sequencing Consortium (IHGSC), followed.

    The most recent phase of this project made extensive use of BAC clones, with an average insert size of about 100 kb (Figure 4.11). Libraries of BAC clones containing human DNA inserts were ordered by a high throughput mapping effort. Restriction digests of each clone in the library were analyzed, and overlapping clones determined by finding fragments in common. The BAC clones were then organized into contiguous overlapping arrays, or contigs. A minimal tiling path needed to determine the sequence of each chromosome was established, and the ends of the BAC clones on that path were sequenced to provide a dense array of markers through the chromosome. BAC clones in the contigs were then sequenced, at this point using the shotgun sequencing of the BAC insert (100 kb), not the whole genome (3.2 million kb). Sequences of BAC clones at about 3X coverage are called draft sequences, and those at higher coverage with gaps filled by directed sequencing are considered finished sequences. A combination of draft and finished sequence data are being assembled using the BAC end sequences and other information. The assembly is publicly available at the Human Genome Browser at the University of California at Santa Cruz ( and the Ensembl site at the Sanger Center (

    Figure 4.11. Directed sequencing of BAC contigs.

    The results of the Celera and public collaboration on the fly sequence was published in early 2000, and descriptions of the human genome sequence were published separately by Celera and IHGSC in 2001. Neither genome is completely sequenced (as of 2001), but both are highly sequenced and are stimulating a major revolution in the life sciences.

    The wisdom of which approach to take is still a matter of debate, and depends to some extent on how thoroughly one needs to sequence a complex genome. For instance, a publicly accessible sequence of the mouse genome at 3X coverage was recently generated by the shotgun approach. Other genomes will likely be “lightly sequenced” at a similar coverage. But a full, high quality sequence of mouse will likely use aspects of the more directed approach. Also, the Celera assembly (primarily shotgun sequence) used the public data on the human genome sequence as well. Thus current efforts use both the rapid sequencing by shotgun methods and as well as sequencing mapped clones.

    Survey of sequenced genomes

    The genome sequences are available for many species now, covering an impressive phylogenetic range. This includes more than 28 eubacteria, at least 6 archaea, a fungus (the yeast Saccharomyces cerevisiae), a protozoan (Plasmodium falciparum), a worm (the nematode Caenorhabditis elegans), an insect (the fruitfly Drosophila melanogaster), two plants (Arabadopsisand rice (soon)), and two mammals (human Homo sapiensand mouse Mus domesticus). Some information about these is listed in Table 4.4.

    Table 4.4.Sequenced genomes. This table is derived from the listing of “Complete Genomes Mapped on the KEGG Pathways (Kyoto Encyclopedia of Genes and Genomes)” at

    Additional genomes have been added, but only samples of the bacterial sequences are listed.

    Genes encoding


    Genome Size








    Escherichia coli





    gram negative

    Haemophilus influenzae





    gram negative

    Helicobacter pylori





    gram negative

    Bacillus subtilis





    gram positive

    Mycoplasma genitalium





    gram positive

    Mycoplasma pneumoniae





    gram positive

    Mycobacterium tuberculosis





    gram positive

    Aquifex aeolicus





    hyperthermophilic bacterium

    Borrelia burgdorferi





    lyme disease Spirochete

    Synechocystis sp.







    Archaeoglobus fulgidus





    S-metabolizing archaea

    Methanococcus jannaschii






    Methanobacterium thermoautotrophicum







    Saccharomyces cerevisiae






    Caenorhabditis elegans





    Drosophila melanogaster



    insect, fly, 120 Mb sequenced

    Arabidopsis thaliana



    plant, complete

    Homo sapiens



    human, draft + finished

    Mus domesticus


    mouse, draft

    Genome size

    Bacterial genomes range in size from 0.58 to almost 5 million bp (Mb). E. coli and B. subtilis, two of the most intensively studied bacteria, have the largest genomes and largest numbers of genes. The genome of the yeast Saccharomyces cerevisiae is only 2.6 times as large as that of E. coli. The genome of humans is almost 700 times larger than that of E. coli. However, genome size is not a direct measure of genetic content over long phylogenetic distances. One needs to examine the fraction of the genome that codes for protein or contains other important information. Let’s look at sizes and numbers of genes in different genomes.

    Gene size and number

    The average gene size is similar among bacteria, averaging around 1100 bp. Very little DNA separates most bacterial genes; in E. colithere is an average of only 118 bp between genes. Since the gene size varies little, then the number of genes varies over as wide a range as the genome size, from 467 genes in M. genitaliumto 4289 in E. coli. Thus within bacteria, which have little noncoding DNA, the number of genes is proportional to the genome size.

    Saccharomyces cerevisiaehas one gene every 1900 bp on average, which could reflect both an increase in size of gene as well as somewhat greater distance between genes. Both bacteria and yeast show a much denser packing of genes than is seen in more complex genomes.

    Data on a large sample of human genes shows that they are much larger than bacterial genes, with the median being about 14 times larger than the 1 kb bacterial genes. This is not because most human proteins are substantially larger; both bacterial proteins average about 350 amino acids in length, which is similar to the median size of human proteins. The major difference is the large amount of intronic sequence in human genes.

    Table 4.5.Average size of human genes and parts of genes. This is based on information in the IHGSC paper in Nature, and derived from analysis of 1804 human genes.



    Internal exon

    122 bp

    145 bp

    Number of exons



    Length of each intron

    1023 bp

    3365 bp

    3’ UTR

    400 bp

    770 bp

    5’ UTR

    240 bp

    300 bp

    Coding sequence

    1100 bp

    1340 bp

    Length of protein encoded

    367 amino acids

    447 amino acids

    Genomic extent

    14,000 bp

    27,000 bp

    Summary of average gene size:

    Bacteria: 1100 bp

    Yeast: ~1200 bp

    Worm: ~5000 bp

    Human: ~27,000 bp

    A comparison of the distribution of sizes of introns and exons show considerable overlap for worms, flies and humans. However, humans have a smaller fraction of long exons and a larger frction of long introns (Figure 4.12).

    Figure 4.12.Distribution of exon and intron length in worms, fly and humans. From the IHGSC paper on the initial analysis of the human genome.

    Distance between genes

    Summary of distance between genes:

    Bacteria: 118 bp

    Yeast: ~700 bp

    Human: may be about 10,000 bp

    The distance between genes differs greatly between larger and smaller genomes. Genes are very close together in bacteria (about 100 bp), and much of that intergenic DNA appears to be involved in regulation. In yeast, the genes are 6 times further apart. In mammals, an enormouse expansion in the amount of DNA between genes is seen. Precise numbers await more complete annotation of the human sequence, but many examples are known of adjacent genes that are separated by 10 to 50 kb of nongenic DNA. In all these species, some DNA sequences regulating expression of genes are found in these intergenic spaces, but it is unlikely that all of this is required for regulation in mammals. Deciphering the important from the expendable sequences in intergenic sequences is a major current challenge. This applies to noncoding DNA in general

    The number of genes per length of the chromosome is a reflection of the size of the genes and the distances between them. This gene densityvaries little in bacteria and yeast, but it changes over a wide range in various regions of the human genome. A higher gene density correlates with higher G+C content of a region (Figure 4.13)

    Figure 4.13. Higher G+C content correlates with higher gene density and shorter introns.

    Genome size increases exponentially, but not number of genes

    Table 4.4. documents a 5500-fold increase in genome size from the smallest bacterial genome to that of human. However, this is accompanied by only a roughly 65-fold increase in the number of genes. This trend is seen over the known range of genomic sequences. The genome size increases exponentially as one examines species covering the range of complexity from bacteria to humans (Figure 4.14). However, the numbe of genes increases linearly. The plot in Figure 4.14 was based on earlier, higher estimates for the number of genes in humans. The effect is even more pronounced if one uses 30,000 as the number of human genes.

    Figure 4.14. Genome size and number of genes in species ranging from bacteria to humans.

    Alternative splicing is common in human genes

    A previous lower estimate is that alternative splicing occurs in 35% of human genes. However, recent data show this fraction is larger.

    For Chromosome 22:

    • 642 transcripts cover 245 genes, 2.6 txpts/gene
    • 2 or more transcripts for 145 (59%) of genes

    For Chromosome 19:

    • 1859 transcripts cover 544 genes, 3.2 txpts/gene

    This contrasts with the situation in worm, in which alternative splicing occurs in 22% of genes. The increased genetic diversity from alternative splicing may contribute considerably to the greater complexity of humans, not just the increase in the number of genes.

    Estimates of number of human genes

    The estimated number of human genes has varied greatly over recent years. Some of these numbers have been widely quoted, and it may be useful to list some of the sources of these estimates.

    • mRNA complexity (association kinetics): 40,000 genes
    • Avg size of gene 30,000 bp: 100,000 genes
    • Number of CpG islands: 70,000 to 80,000
    • Unigene clusters of ESTs: 35,000 to 125,000
    • More rigorous EST clustering: 35,000 genes
    • Comparison to pufferfish: 30,000 genes
    • Extrapolate from gene counts on chromosomes 21 and 22 (which are finished): 30,000 to 35,500 genes

    Using the draft human sequence from Juy 2000, the IHGSC constructed an Initial Gene Index for human. They use the Ensembl system at the Sanger Centre. They started with ab initio predictions by Genscan, then confirmed by similarity to proteins, mRNAs, ESTs, and protein motifs (Pfam database) from any organism. This led to an initial set of 35,500 genes and 44,860 transcripts in the Ensemble database. After reducing fragmentation, merging with known genes, and removing contaminating bacterial sequences, they were left with 31,778 genes. After taking into account residual fragmentation, and the rate at which true genes are found by a similar analysis, the estimate remains about 32,000 genes. However, it is an estimate and is subject to change as more annotation is completed..

    Starting with this estimate that the human genome contains about 32,000 genes, one can calculate how much of the genome is coding and how much is transcribed. If the average coding length is 1400 bp, then 1.5%of human genome consists of coding sequence. If the average genomic extent per gene is 30 kb, then 33% of human genome is “transcribed”.

    Summary of number of genes in eukaryotic species:

    • Human: 32,000 “still uncertain”
    • Fly: 13, 338
    • Worm: 18,266
    • Yeast: 6,144
    • Mustard weed: 25,706
    • Human: 2x number of genes in fly and worm
    • Human: more alternative splicing, perhaps 5x number of proteins as in fly or worm

    Assignment of functions to genes

    Genes encoding proteins and RNAs can be detected with considerable accuracy using compuational tools. Note that even for an extensively studies organism like E. coli, the number of genes found by sequence analysis (4289 encoding proteins) is far greater than the number that can be assigned as encoding a particular enzyme (1254). The discrepancy between genes found in the sequence versus those with known function (i.e. assigned as encoding an enzyme) is greater for some poorly characterized organisms such as the lyme-disease causing Spirochete Borrelia burgdorferi.

    The many genes with unassigned function present an exciting challenge both in bioinformatics and in biochemistry/cell biology/genetics. Large collaborations have been initiated for a comprehensive genetic and expression analysis of some organisms. For instance, projects are underway to make mutations in all detected genes in Saccharomyces cerevisiae and to quantify the level of stable RNA from each gene in a variety of growth conditions, through the cell cycle and in other conditions. Databases are already established that record the changes in RNA levels for all yeast genes when the organism is shifted from glucose to galactose as a carbon source. These large scale expression analysis use high density microchip arrays that contain characteristic sequences for all 6064 yeast genes. These gene arrays are then hybridized with fluorescently labeled RNA or cDNA from cells grown under the two different conditions. The hybridization signals are quantitated and compared automatically, analyzed. The plan is to store the results in public databases. Useful websites include:

    • SGD
    • MIPS: a database for genomes and protein sequences

    Contributors and Attributions