Skip to main content
Biology LibreTexts

7.5: Algorithmic Settings for HMMs

  • Page ID
    40957
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    We use HMMs for three types of operation: scoring, decoding, and learning. We will talk about scoring and decoding in this lecture. These operations can happen for a single path or all possible paths. For the single path operations, our focus is on discovering the path with maximum probability. However, we are interested in a sequence of observations or emissions for all path operations regardless of its corresponding paths.

    Scoring

    Scoring over a single path

    The Dishonest Casino problem and Prediction of GC-rich Regions problem described in section 7.4 are both examples of finding the probability score corresponding to a single path. For a single path we define the scoring problem as follows:

    • Input: A sequence of observations x = x1x2 . . . xn generated by an HMM M(Q, A, p, V, E) and a path of states π = π1π2 ...πn.

    • Output: Joint probability, P(x,π) of observing x if the hidden state sequence is π.

    page158image12495008.png
    Figure 7.15: The six algorithmic settings for HMMS

    The single path calculation is essentially the likelihood of observing the given sequence over a particular path using the following formula:

    P(x,π) = P(x|π)P(π)
    We have already seen the examples of single path scoring in our Dishonest Casino and GC-rich region

    examples.

    Scoring over all paths

    We define the all paths version of scoring problem as follows:

    • Input: A sequence of observations x = x1x2 ...xn generated by an HMM M(Q,A,p,V,E).

    • Output: The joint probability, P(x,π) of observing x over all possible sequences of hidden states π.

    The probability over all paths π of hidden states of the given sequence of observations is given by the following formula.

    \[P(x)=\sum_{\pi} P(x, \pi) \nonumber \]

    We use this score when we are interested in knowing the likelihood of a particular sequence for a given HMM. However, naively computing this sum requires considering an exponential number of possible paths. Later in the lecture we will see how to compute this quantity in polynomial time.

    7.5.2 Decoding

    Decoding answers the question: Given some observed sequence, what path gives us the maximum likelihood of observing this sequence? Formally we define the problem as follows:

    • Decoding over a single path:
    – Input: A sequence of observations x = x1x2 ...xN generated by an HMM M(Q,A,p,V,E).

    – Output: The most probable path of states, \(\pi^{*}=\pi_{1}^{*} \pi_{2}^{*} \ldots \pi_{N}^{*}\)

    • Decoding over all paths:

    – Input: A sequence of observations x = x1x2 ...xN generated by an HMM M(Q,A,p,V,E).

    – Output: The path of states, \(\pi^{*}=\pi_{1}^{*} \pi_{2}^{*} \ldots \pi_{N}^{*}\) that contains the most likely state at each time point.

    In this lecture, we will look only at the problem of decoding over a single path. The problem of decoding over all paths will be discussed in the next lecture.

    For the single path decoding problem, we can imagine a brute force approach where we calculate the joint probabilities of a given emission sequence and all possible paths and then pick the path with the maximum joint probability. The problem is that there are an exponential number of paths and using such a brute force search for the maximum likelihood path among all possible paths is very time consuming and impractical. Dynamic Programming can be used to solve this problem. Let us formulate the problem in the dynamic programming approach.

    We would like to find out the most likely sequence of states based on the observation. As inputs, we are given the model parameters ei(s),the emission probabilities for each state, and aijs, the transition probabilities. The sequence of emissions x is also given. The goal is to find the sequence of hidden states, π, which maximizes the joint probability with the given sequence of emissions. That is,

    \(\pi^{*}=\arg \max _{\pi} P(x, \pi) \nonumber\)

    Given the emitted sequence x we can evaluate any path through hidden states. However, we are looking for the best path. We start by looking for the optimal substructure of this problem.

    For a best path, we can say that, the best path through a given state must contain within it the following: • The best path to previous state
    • The best transition from previous state to this state
    • The best path to the end state

    Therefore the best path can be obtained based on the best path of the previous states, i.e., we can find a recurrence for the best path. The Viterbi algorithm is a dynamic programming algorithm that is commonly used to obtain the best path.

    Most probable state path: the Viterbi algorithm

    Suppose vk(i) is the known probability of the most likely path ending at position (or time instance) i in state k for each k. Then we can compute the corresponding probabilities at time i + 1 by means of the following recurrence.

    \[ v_{l}(i+1)=e_{l}\left(x_{i+1}\right) \max _{k}\left(a_{k l} v_{k}(i)\right)\nonumber \]

    The most probable path π, or the maximum P(x,π), can be found recursively. Assuming we know vj (i − 1), the score of the maximum path up to time i − 1, we need to increase the computation for the next time step. The new maximum score path for each state depends on

    • The maximum score of the previous states

    • The transition probability
    • The emission probability.

    In other words, the new maximum score for a particular state at time i is the one that maximizes the transition of all possible previous states to that particular state (the penalty of transition multiplied by their maximum previous scores multiplied by emission probability at the current time).

    All sequences have to start in state 0 (the begin state). By keeping pointers backwards, the actual state sequence can be found by backtracking. The solution of this Dynamic Programming problem is very similar to the alignment algorithms that were presented in previous lectures.

    The steps of the Viterbi algorithm [2] are summarized below:
    1. Initialization \( (i=0): v_{0}(0)=1, v_{k}(0)=0 \text { for } k>0\)

    2. Recursion \((i=1 \ldots N): v_{k}(i)=e_{k}\left(x_{i}\right) \max _{j}\left(a_{j k} v_{j}(i-1)\right) ; p t r_{i}(l)=\arg \max _{j}\left(a_{j k} v_{j}(i-1)\right)\)

    3.Termination: \(P\left(x, \pi^{*}\right)=\max _{k} v_{k}(N) ; \pi_{N}^{*}=\arg \max _{k} v_{k}(N) \)

    4. Traceback \((i=N \ldots 1): \pi_{i-1}^{*}=p t r_{i}\left(\pi_{i}^{*}\right)\)

    page160image12184128.png
    Figure 7.16: The Viterbi algorithm

    As we can see in Figure 7.16, we fill the matrix from left to right and trace back. Each position in the matrix has K states to consider and there are KN cells in the matrix, so, the required computation time is O(K2N) and the required space is O(KN) to remember the pointers. In practice, we use log scores for the computation. Note that the running time has been reduced from exponential to polynomial.

    Evaluation

    Evaluation is about answering the question: How well does our model of the data capture the actual data? Given a sequence x, many paths can generate this sequence. The question is how likely is the sequence given the model? In other words, is this a good model? Or, how well does the model capture the exact characteristics of a particular sequence? We use evaluation of HMMs to answer these questions. Additionally, with evaluation we can compare different models.

    Let us first provide a formal definition of the Evaluation problem.
    • Input: A sequence of observations x = x1x2 ...xN and an HMM M(Q,A,p,V,E).
    • Output: The probability that x was generated by M summed over all paths.
    We know that if we are given an HMM we can generate a sequence of length n using the following steps:

    • Start at state π1 according to probability a0π1 (obtained using vector, p).
    • Emit letter x1 according to emission probability eπ1 (x1).
    • Go to state π2 according to the transition probability aπ12
    • Keep doing this until emit xN.

    Thus we can emit any sequence and calculate its likelihood. However, many state sequence can emit the same x. Then, how do we calculate the total probability of generating a given x over all paths? That is, our goal is to obtain the following probability:

    \[P(x \mid M)=P(x)=\sum_{\pi} P(x, \pi)=\sum_{\pi} P(x \mid \pi) P(\pi)\nonumber\]

    The challenge of obtaining this probability is that there are too many paths (an exponential number) and each path has an associated probability. One approach may be using just the Viterbi path and ignoring the others, since we already know how to obtain this path. But its probability is very small as it is only one of the many possible paths. It is a good approximation only if it has high probability density. In other cases, the Viterbi path will give us an inaccurate approximation. Alternatively, the correct approach for calculating the exact sum iteratively is through the use of dynamic programming. The algorithm that does this is known as Forward Algorithm.

    The Forward Algorithm

    First we derive the formula for forward probability f(i).

    \[\begin{aligned}
    f_{l}(i) &=P\left(x_{1} \ldots x_{i}, \pi=l\right) \\
    &=\sum_{\pi_{1} \ldots \pi_{i-1}} P\left(x_{1} \ldots x_{i-1}, \pi_{1}, \ldots, \pi_{i-2}, \pi_{i-1}, \pi_{i}=l\right) e_{l}\left(x_{i}\right) \\
    &=\sum_{k} \sum_{\pi_{1} \ldots \pi_{i-2}} P\left(x_{1} \ldots x_{i-1}, \pi_{1}, \ldots, \pi_{i-2}, \pi_{i-1}=k\right) a_{k l} e_{l}\left(x_{i}\right) \\
    &=\sum_{k} f_{k}(i-1) a_{k l} e_{l}\left(x_{i}\right) \\
    &=e_{l}\left(x_{i}\right) \sum_{k} f_{k}(i-1) a_{k l}
    \end{aligned}\]

    The full algorithm[2] is summarized below:
    • Initialization \((i=0): f_{0}(0)=1, f_{k}(0)=0 \text { for } k>0\)

    • Iteration \((i=1 \ldots N): f_{k}(i)=e_{k}\left(x_{i}\right) \sum_{j} f_{j}(i-1) a_{j k}\)

    • Termination: \(P\left(x, \pi^{*}\right)=\sum_{k} f_{k}(N)\)

    page161image12166288.png
    Figure 7.17: The Forward algorithm

    From Figure 7.17, it can be seen that the Forward algorithm is very similar to the Viterbi algorithm. In the Forward algorithm, summation is used instead of maximization. Here we can reuse computations of the previous problem including penalty of emissions, penalty of transitions and sums of previous states. The required computation time is O(K2N) and the required space is O(KN). The drawback of this algorithm is that in practice, taking the sum of logs is difficult; therefore, approximations and scaling of probabilities are used instead.


    This page titled 7.5: Algorithmic Settings for HMMs is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Manolis Kellis et al. (MIT OpenCourseWare) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.