11.2: Whole Genome Sequencing

Last updated
Save as PDF

Page ID: 27307

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

The need for assembly

Given that the length of a single, individual sequencing read is somewhere between 45bp and 700bp, we are faced with a problem determining the sequence of longer fragments, such as the chromosomes in an entire genome of humans (3 x10⁹ bp). Obviously, we need to break the genome into smaller fragments. There are two different strategies for doing this:

clone-by-clone sequencing, which relies on the creation of a physical map first then sequencing, and
whole genome shotgun sequencing, which sequences first and does not require a physical map.

Physical mapping

A physical map is a representation of a genome, comprised of cloned fragments of DNA. The map is therefore made from physical entities (pieces of DNA) rather than abstract concepts such as the linkage frequencies and genes that make up a genetic map (Figure \(\PageIndex{1}\)). It is usually possible to correlate genetic and physical maps, for example by identifying the clone that contains a particular molecular marker. The connection between physical and genetic maps allows the genes underlying particular mutations to be identified through a process call map-based cloning.

Fig11.7.png — Figure \(\PageIndex{1}\): A portion of the physical map for human chromosome 4. The entire chromosome is shown at left. The physical map is made up of small blue lines, each of which represents a cloned piece of DNA approximately 100kb in length. (NCBI-unknown-PD)

To create a physical map, large fragments of the genome are cloned into plasmid vectors, or into larger vectors called bacterial artificial chromosomes (BACs). BACs can contain approximately 100kb fragments. The set of BACs produced in a cloning reaction will be redundant, meaning that different clones will contain DNA from the same part of the genome. Because of this redundancy, it is useful to select the minimum set of clones that represent the entire genome, and to order these clones respective to the sequence of the original chromosome. Note that this is all to be done without knowing the complete sequence of each BAC. Making a physical map may therefore rely on techniques related to Southern blotting: DNA from the ends of one BAC is used as a probe to find clones that contain the same sequence. These clones are then assumed to overlap each other. A set of overlapping clones is called a contig.

Clone-by-clone sequencing

Physical mapping of cloned sequences was once considered a pre-requisite for genome sequencing. The process would begin by breaking the genome into BAC-sized pieces, arranging these BACs into a map, then breaking each BAC up into a series of smaller clones, which were usually then also mapped. Eventually, a minimum set of smaller clones would be identified, each of which was small enough to be sequenced (Figure \(\PageIndex{8}\)). Because the order of clones relative to the complete chromosome was known prior to sequencing, the resulting sequence information could be easily assembled into one complete chromosome at the end of the project. Clone-by-clone sequencing therefore minimizes the number of sequencing reactions that must be performed, and makes sequence assembly straightforward and reliable. However, a drawback of this strategy is the tedious process of building physical map prior to any sequencing.

Whole genome shotgun sequencing

This strategy breaks the genome into fragments that are small enough to be sequenced, then reassembles them simply by looking for overlaps in the sequence of each fragment. It avoids the laborious process of making a physical map (Figure \(\PageIndex{2}\)). However, it requires many more sequencing reactions than the clone-by-clone method, because, in the shotgun approach, there is no way to avoid sequencing redundant fragments. There is also a question of the feasibility of assembling complete chromosomes based simply on the sequence overlaps of many small fragments. This is particularly a problem when the size of the fragments is smaller than the length of a repetitive region of DNA. Nevertheless, this method has now been successfully demonstrated in the nearly complete sequencing of many large genomes (rice, human, and many others). It is the current standard methodology.

However, shotgun assemblies are rarely able to complete entire genomes. The human genome, for example, relied on a combination of shotgun sequence and physical mapping to produce contiguous sequence for the length of each arm of each chromosome. Note that because of the highly repetitive nature of centromeric and telomeric DNA, sequencing projects rarely include these heterochromatic, gene poor regions.

Fig11.8.png — Figure \(\PageIndex{2}\): Genome sequencing strategies. A clone-by-clone strategy (left) in which the genome is divided into progressively smaller units (clones) before sequencing. Whole genome shotgun strategy (right) assembles the sequence from all the smaller reads. (Origianl-Deyholos-CC:AN)

Genome analysis

An assembled genome is a string of millions of A’s,C’s,G’s,T’s. Which of these represent nucleotides that encode proteins, and which of these represent other features of genes and their regulatory elements? The process of genome annotation relies on computers to define features such a start and stop codons, introns, exons, and splice sites. However, few of the predictions made by these programs is entirely accurate, and most must be verified experimentally for any gene of particular importance or interest.