2.6: Multiple alignment

Last updated
Save as PDF

Page ID: 40916

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

Aligning three sequences

Now that we have seen how to align a pair of sequences, it is natural to extend this idea to multiple sequences. Suppose we would like to find the optimal alignment of 3 sequences. How might we proceed?

Recall that when we align two sequences S and T, we choose the maximum of three possibilities for the final position of the alignment (sequence T aligned against a gap, sequence S aligned against a gap, or sequence S aligned against sequence T):

\[F_{i, j}=\max \left\{\begin{array}{l}
F_{i, j-1}+d \\
F_{i-1, j}+d \\
F_{i-1, j-1}+s\left(S_{i}, T_{j}\right)
\end{array}\right. \nonumber \]

For three sequences S,T, and U, there are seven possibilities for the final position of the alignment. That is, there are three ways to have two gaps in the final position, three ways to have one gap, and one way to have all three sequences aligned \(\left(\left(\begin{array}{l}
3 \\
1
\end{array}\right)+\left(\begin{array}{l}
3 \\
2
\end{array}\right)+\left(\begin{array}{l}
3 \\
3
\end{array}\right)=7\right)\). The update rule is now:

\[F_{i, j, k}=\max \left\{\begin{array}{l}
F_{i-1, j, k}+s\left(S_{i},-,-\right) \\
F_{i, j-1, k}+s\left(-, T_{j},-\right) \\
F_{i, j, k-1}+s\left(-,-, U_{k}\right) \\
F_{i-1, j-1, k}+s\left(S_{i}, T_{j},-\right) \\
F_{i-1, j, k-1}+s\left(S_{i},-, U_{k}\right) \\
F_{i, j-1, k-1}+s\left(-, T_{j}, U_{k}\right) \\
F_{i-1, j-1, k-1}+s\left(S_{i}, T_{j}, U_{k}\right)
\end{array}\right. \nonumber \]

where s is the function describing gap, match, and mismatch scores.

This approach, however, is exponential in the number of sequences we are aligning. If we have k sequences of length \(n\), computing the optimal alignment using a k-dimensional dynamic programming matrix takes \(O\left((2 n)^{k}\right)\) time (the factor of 2 results from the fact that a k-cube has 2^k vertices, so we need to take the maximum of 2^k − 1 neighboring cells for each entry in the score matrix). As you can imagine, this algorithm quickly becomes impractical as the number of sequences increases.

Heuristic multiple alignment

One commonly used approach for multiple sequence alignment is called progressive multiple alignment. As- sume that we know the evolutionary tree relating each of our sequences. Then we begin by performing a pairwise alignment of the two most closely-related sequences. This initial alignment is called the seed alignment. We then proceed to align the next closest sequence to the seed, and this new alignment replaces the seed. This process continues until the final alignment is produced.

In practice, we generally do not know the evolutionary tree (or guide tree), this technique is usually paired with some sort of clustering algorithm that may use a low-resolution similarity measure to generate an estimation of the tree.

While the running time of this heuristic approach is much improved over the previous method (polynomial in the number of sequences rather than exponential), we can no longer guarantee that the final alignment is optimal.

Note that we have not yet explained how to align a sequence against an existing alignment. One possible approach would be to perform pairwise alignments of the new sequence with each sequence already in the seed alignment (we assume that any position in the seed alignment that is already a gap will remain one). Then we can add the new sequence onto the seed alignment based on the best pairwise alignment (this approach was previously described by Feng and Doolittle[4]). Alternatively, we can devise a function for scoring the alignment of a sequence with another alignment (such scoring functions are often based on the pairwise sum of the scores at each position).

Design of better multiple sequence alignment tools is an active area of research. Section 2.7 details some of the current work in this field.