8.5: Using HMMs to align sequences with affine gap penalties

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$

( \newcommand{\kernel}{\mathrm{null}\,}\) $$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\id}{\mathrm{id}}$$

$$\newcommand{\Span}{\mathrm{span}}$$

$$\newcommand{\kernel}{\mathrm{null}\,}$$

$$\newcommand{\range}{\mathrm{range}\,}$$

$$\newcommand{\RealPart}{\mathrm{Re}}$$

$$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$

$$\newcommand{\Argument}{\mathrm{Arg}}$$

$$\newcommand{\norm}[1]{\| #1 \|}$$

$$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$

$$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

$$\newcommand{\vectorA}[1]{\vec{#1}} % arrow$$

$$\newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow$$

$$\newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vectorC}[1]{\textbf{#1}}$$

$$\newcommand{\vectorD}[1]{\overrightarrow{#1}}$$

$$\newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}}$$

$$\newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}}$$

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$

$$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$

We can use HMM to align sequences with affine gap penalties. Recall that affine gap penalties penalizes more to open/ start the gap than to extend it, thus the penalty of a gap of length g is r(g) = -d -(g-1)*e, where d is the penalty to open the gap and e is the penalty to extend an already open gap.

We will look into aligning two sequences with the affine gap penalty. We are given two sequences are X and Y, the scoring matrix S (S(xi,yj) = score of matching xi with yj), gap opening penalty of d and gap extension penalty of e. We can map this problem into an HMM problem by using the following states, transition probabilities and emission probabilities.

States:

There are three states involves: M (matching xi with yj), X (aligning xi with a gap), Y (aligning yj with a gap). Also, alongside each transition, there’s an update of the i,j indices. Whenever we are in state M, (i,j) = (i,j) + (1,1). In state X, (i,j) = (i,j) + (1,0). In state Y, (i,j) = (i,j) + (0,1).

Transition probabilities:
There are 7 transition probabilities to consider as shown in figure 6. P(next State = M | current = M) = S(xi,yj)
P(next State = X | current = M) = d
P(next State = Y | current = M ) = d
P(next State = X | current = X) = e
P(next State = M | current = X ) = S(xi,yj)
P(next State = Y | current = Y) = e
P(next State = M | current = Y) = S(xi,yj)

We can also save the transition probabilities in a transition matrix A = [aij], where aij = P(next State = j | current = i) and $$\sum_{j}$$jAij = 1

Emission probabilities:
The emission probabilities are:
From state M: pxiyi = p(xi aligned to yj)

From state X: qxi = p(xi aligned to gap)

From state Y: qyi= p(yjaligned to gap)

Example:

Y = ’HLAESK’
The alignment generated by the model is: MMXXMYM
Which corresponds to:
For classification purposes, Posterior decoding ’path’ is more informative than Viterbi path as it is a more refined measure of which hidden states generated x. However, it may give an invalid sequence of states, for example when not all j $->$ k transitions may be possible, it might have state(i) = j and state(i+1) =k