8.5: Using HMMs to align sequences with affine gap penalties
- Page ID
- 40964
We can use HMM to align sequences with affine gap penalties. Recall that affine gap penalties penalizes more to open/ start the gap than to extend it, thus the penalty of a gap of length g is r(g) = -d -(g-1)*e, where d is the penalty to open the gap and e is the penalty to extend an already open gap.
We will look into aligning two sequences with the affine gap penalty. We are given two sequences are X and Y, the scoring matrix S (S(xi,yj) = score of matching xi with yj), gap opening penalty of d and gap extension penalty of e. We can map this problem into an HMM problem by using the following states, transition probabilities and emission probabilities.
States:
There are three states involves: M (matching xi with yj), X (aligning xi with a gap), Y (aligning yj with a gap). Also, alongside each transition, there’s an update of the i,j indices. Whenever we are in state M, (i,j) = (i,j) + (1,1). In state X, (i,j) = (i,j) + (1,0). In state Y, (i,j) = (i,j) + (0,1).
Transition probabilities:
There are 7 transition probabilities to consider as shown in figure 6. P(next State = M | current = M) = S(xi,yj)
P(next State = X | current = M) = d
P(next State = Y | current = M ) = d
P(next State = X | current = X) = e
P(next State = M | current = X ) = S(xi,yj)
P(next State = Y | current = Y) = e
P(next State = M | current = Y) = S(xi,yj)
We can also save the transition probabilities in a transition matrix A = [aij], where aij = P(next State = j | current = i) and \(\sum_{j}\)jAij = 1
Emission probabilities:
The emission probabilities are:
From state M: pxiyi = p(xi aligned to yj)
From state X: qxi = p(xi aligned to gap)
From state Y: qyi= p(yjaligned to gap)
Example:
X = ’VLSPADK’
Y = ’HLAESK’
The alignment generated by the model is: MMXXMYM
Which corresponds to:
X = ’VLSPAD K’
Y = ’HL_ _AESK’
Did You Know?
For classification purposes, Posterior decoding ’path’ is more informative than Viterbi path as it is a more refined measure of which hidden states generated x. However, it may give an invalid sequence of states, for example when not all j $->$ k transitions may be possible, it might have state(i) = j and state(i+1) =k