Skip to main content
Biology LibreTexts

5.2: Genome Assembly I- Overlap-Layout-Consensus Approach

  • Page ID
    40936
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Many areas of research in computational biology rely on the availability of complete whole-genome sequence data. Yet the process to sequence a whole genome is itself non-trivial and an area of active research. The problem lies in the fact that current genome-sequencing technologies cannot continuously read from one end of a long genome sequence to the other; they can only accurately sequence small sections of base pairs (ranging from 100 to a few thousand, depending on the method), called reads. Therefore, in order to construct a sequence of millions or billions of base pairs (such as the human genome), computational biologists must find ways to combine smaller reads into larger, continuous DNA sequences. First, we will examine aspects of the experimental setup for the overlap-layout-consensus approach, and then we will move forward to learning about how to combine reads and learn information from them

    Setting up the experiment

    The first challenge that must be tackled when setting up this experiment is that we need to start with many copies of each chromosome in order to use this approach. This number is on the order of 105. It is important to note that the way we obtain these copies is very important and will affect our outcomes later on as it many of the comparisons we make will depend on consistent data. The first way that we may think to get this much data is to amplify a given genome. However, amplification does damage which will throw off our algorithms in later steps and cause worse results. Another possible method would be to inbreed the genome to get many copies of each chromosome. If you are looking to get rid of polymorphism, this may be a good technique, but we also lose valuable data from the polymorphic sites when we inbreed. A suggested method for obtaining this data is to use one individual, though the organism would need to be rather large. We could also use techniques such as progeny of one or progeny of two to get as few versions of each chromosome as possible. This will get high sequencing depth on each chromosome, which is the reason we want all chromosomes to be as similar as possible.

    Next, let’s look as how we could decide on our read lengths given current technology. Looking at (Figure 5.2), we can see that a cost-benefit analysis must be done to decide which platform to use on a given project. With current technology, we commonly use HiSeq2500 with a read length of about 250, though this is rapidly changing.

    page112image9231472.png
    Figure 5.2: Here is a quick look at a few platforms that can be used to read genomes.

    Finally, let’s look at a few sequences that cause trouble when using platforms with short reads. Sequences with high GC content (e.g. GGCGGCGATC), low GC content (e.g. AAATAATCAA), or low complexity (e.g. ATATATATA) can cause trouble with short reads. This is still an active area of research, but some possible explanations include Polymerase slippage and DNA denaturing too easily or not easily enough.

    This section will examine one of the most successful early methods for computationally assembling a genome from a set of DNA reads, called shotgun sequencing (Figure 5.3). Shotgun sequencing involves randomly shearing multiple copies of the same genome into many small fragments, as if the DNA were shot with a shotgun. Typically, the DNA is actually fragmented using either sonication (brief bursts from an ultrasound) or a targeted enzyme designed to cleave the genome at specific sequence motifs. Both of these methods can be tuned to create fragments of varying sizes.

    After the DNA has been amplified and fragmented, the technique developed by Frederick Sanger in 1977 called chain-termination sequencing (also called Sanger sequencing) is used to sequence the fragments. In brief, fragments are extended by DNA polymerase until a dideoxynucleotriphosphate is incorporated; these special nucleotides cause the termination of a fragment’s extension. The length of the fragment therefore becomes a proxy for where a given ddNTP was added in the sequence. One can run four separate reactions, each with a different ddNTP (A, G, C, T) and then run out the results on a gel in order to determine the relative ordering of bases. The result is many sequences of bases with corresponding per-base quality scores, indicating the probability that each base was called correctly. The shorter fragments can be fully sequenced, but the longer fragments can only be sequenced at each of their ends since the quality diminishes significantly

    page113image9304704.png
    © source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    Figure 5.3: Shotgun sequencing involves randomly shearing a genome into small fragments so they can be sequenced, and then computationally reassembling them into a continuous sequence.

    after about 500-900 base pairs. These paired-end reads are called mate pairs. In the rest of this section, we discuss how to use the reads to construct much longer sequences, up to the size of entire chromosomes.

    Finding overlapping reads

    To combine the DNA fragments into larger segments, we must find places where two or more reads over- lap, i.e. where the beginning sequence of one fragment matches the end sequence of another fragment. For example, given two fragments such as ACGTTGACCGCATTCGCCATA and GACCGCATTCGCCATACG- GCATT, we can construct a larger sequence based on the overlap: ACGTTGACCGCATTCGCCATACGGCATT (Figure 5.4).

    page113image8578192.png
    Figure 5.4: Constructing a sequence from read overlap

    One method for finding matching sequences is the Needleman-Wunsch dynamic programming algorithm, which was discussed in chapter 2. The Needleman-Wunsch method is impractical for genome assembly, however, since we would need to perform millions of pairwise-alignments, each taking O(n2) time, in order to construct an entire genome from the DNA fragments.

    A better approach is to use the BLAST algorithm (discussed in chapter 3) to hash all the k-mers (unique sequences of length k) in the reads and find all the locations where two or more reads have one of the k-mers in common. This allows us to achieve O(kn) efficiency rather than O(n2) pairwise comparisons. k can be any number smaller than the size of the reads, but varies depending on the desired sensitivity and specificity. By adjusting the read length to span the repetitive regions of the genome, we can correctly resolve these regions and come very close to the ideal of a complete, continuous genome. One popular overlap-layout-consensus assembler called Arachne uses k = 24 [2].

    Given the matching k-mers, we can align each of the corresponding reads and discard any matches that are less than 97% similar. We do not require that the reads be identical since we allow for the possibility of sequencing errors and heterozygosity (i.e., a diploid organism like a human may have two different variants at a polymorphic site).

    Merging reads into contigs

    Using the techniques described above to find overlaps between DNA fragments, we can piece together larger segments of continuous sequences called contigs. One way to visualize this process is to create a graph in which all the nodes represent reads, and the edges represent overlaps between the reads (Figure 5.5). Our graph will have transitive overlap; that is, some edges will connect disparate nodes that are already connected by intermediate nodes. By removing the transitively inferable overlaps, we can create a chain of reads that have been ordered to form a larger contig. These graph transformations are discussed in greater depth in section 5.3.1 below. In order to get a better understanding of the size of contigs, we calculate something known as N50. Because measures of contig length tend to be highly sensitive to the smallest contig cutoff, N50 is calculated as the length-weighted median. For a human, N50 is usually close to 125 kb.

    page114image8572784.png
    Figure 5.5: We can visualize the process of merging fragments into contigs by letting the nodes in a graph represent reads and edges represent overlaps. By removing the transitively inferable edges (the pink edges in this image), we are left with chains of reads ordered to form contigs.

    In theory, we should be able to use the above approach to create large contigs from our reads as long as we have adequate coverage of the given region. In practice, we often encounter large sections of the genome that are extremely repetitive and as a result are difficult to assemble. For example, it is unclear exactly how to align the following two sequences: ATATATAT and ATATATATAT. Due to the extremely low information content in the sequence pattern, they could overlap in any number of ways. Furthermore, these repetitive regions may appear in multiple locations in the genome, and it is difficult to determine which reads come from which locations. Contigs made up of these ambiguous, repetitive reads are called overcollapsed contigs.

    In order to determine which sections are overcollapsed, it is often possible to quantify the depth of coverage of fragments making up each contig. If one contig has significantly more coverage than the others, it is a likely candidate for an overcollapsed region. Additionally, several unique contigs may overlap one contig in the same location, which is another indication that the contig may be overcollapsed (Figure 5.6).

    page115image9181280.png
    © source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    Figure 5.6: Overcollapsed contigs are caused by repetetive regions of the genome which cannot be distin- guished from one another during sequencing. Branching patterns of alignment that arise during the process of merging fragments into contigs are a strong indication that one of the regions may be overcollapsed.

    After fragments have been assembled into contigs up to the point of a possible repeated section, the result is a graph in which the nodes are contigs, and the edges are links between unique contigs and overcollapsed contigs (Figure 5.7).

    page115image9187312.png
    © source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    Figure 5.7: In this graph connecting contigs, repeated region X has indegree and outdegree equal to 2. The target seqence shown at the top can be inferred from the links in the graph.

    Laying out contig graph into scaffolds

    Once our fragments are assembled into contigs and contig graphs, we can use the larger mate pairs to link contigs into supercontigs or scaffolds. Mate pairs are useful both to orient the contigs and to place them in the correct order. If the mate pairs are long enough, they can often span repetitive regions and help resolve the ambiguities described in the previous section (Figure 5.8).

    page116image9183776.png
    © source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    Figure 5.8: Mate pairs help us determine the relative order of contigs in order to link them into into supercontigs.

    Unlike contigs, supercontigs may contain some gaps in the sequence due to the fact that the mate pairs connecting the contigs are only sequenced at the ends. Since we generally know how long a given mate pair is we can estimate how many base pairs are missing, but due to the randomness of the cuts in shotgun sequencing, we may not have the data available to fill in the exact sequence. Filling in every single gap can be extremely expensive, so even the most completely assembled genomes usually contain some gaps.

    Deriving consensus sequence

    The goal of genome assembly is to create one continuous sequence, so after the reads have been aligned into contigs, we need to resolve any differences between them. As mentioned above, some of the overlapping reads may not be identical due to sequencing errors or polymorphism. We can often determine when there has been a sequencing error when one base disagrees with all the other bases aligned to it. Taking into account the quality scores on each of the bases, we can usually resolve these conflicts fairly easily. This method of conflict resolution is called weighted voting (Figure 5.9). Another alternative is to ignore the frequencies of each base and take the maximum quality letter as the consensus. Sometimes, you will want to keep all of the bases that form a polymorphic set because it can be important information. In this case, we would be unable to use these methods to derive a consensus sequence.

    TAG. TTACACAGATTATGACTTCATG CTA.png
    Figure 5.9: We derive the multiple alignment consensus sequence by weighted voting at each base.

    In some cases, it is not possible to derive a consensus if, for example, the genome is heterozygous and there are equal numbers of two different bases at one location. In this case, the assembler must choose a representative.

    Did You Know?

    Since polymorphism can significantly complicate the assembly of diploid genomes, some researchers induce several generations of inbreeding in the selected species to reduce the amount of heterozygosity before attempting to sequence the genome.

    In this section, we saw an algorithm to do genome assembly given reads. However, this algorithm works well when the reads are 500 - 900 bases long or more, which is typical of Sanger sequencing. Alternate genome assembly algorithms are required is the reads we get from our sequencing methods are much shorter.


    This page titled 5.2: Genome Assembly I- Overlap-Layout-Consensus Approach is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Manolis Kellis et al. (MIT OpenCourseWare) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.