Skip to main content
Biology LibreTexts

7.7: Further Reading, What Have We Learned?

  • Page ID
    40959
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Length Distributions of States and Generalized Hidden Markov Models

    Given a Markov chain with the transition from any state to the end state having probability τ , the probability of generating a sequence of length L (and then finishing with a transition to the end state) is given by:

    \[ \tau(1-\tau)^{L-1} \nonumber \]

    Similarly, in the HMMs that we have been examining, the length of states will be exponentially dis- tributed, which is not appropriate for many purposes. (For example, in a genomic sequence, an exponential distribution does not accurately capture the lengths of genes, exons, introns, etc). How can we construct a model that does not output state sequences with an exponential distribution of lengths? Suppose we want to make sure that our sequence has length exactly 5. We might construct a sequence of five states with only a single path permitted by the transition probabilities. If we include a self loop in one of the states, we will output sequences of minimum length 5, with longer sequences exponentially distributed. Suppose we have a chain of n states, with all chains starting with state π1 and transitioning to an end state after πn. Also assume that the transition probability between state πi and πi+1 is 1−p, while the self transition probability of state πi is p. The probability that a sequence generated by this Markov chain has length L is given by:

    \[\left(\begin{array}{l}
    L-1 \\
    n-1
    \end{array}\right) p^{L-n}(1-p)^{n} \nonumber \]

    This is called the negative binomial distribution.

    More generally, we can adapt HMMs to produce output sequences of arbitrary length. In a Generalized Hidden Markov Model [1] (also known as a hidden semi-Markov model), the output of each state is a string of symbols, rather than an individual symbol. The length as well as content of this output string can be chosen based on a probability distribution. Many gene finding tools are based on generalized hidden Markov models.

    Conditional random fields

    The conditional random field model a discriminative undirected probabilistic graphical model that is used alternatively to HMMs. It is used to encode known relationships between observations and construct con- sistent interpretations. It is often used for labeling or parsing of sequential data. It is widely used in gene finding. The following resources can be helpful in order to learn more about CRFs:

    • Lecture on Conditional Random Fields from Probabilistic Graphical Models course: class. coursera.org/pgm/lecture/preview/33. For background, you might also want to watch the two previous segments, on pairwise Markov networks and general Gibbs distributions.
    • Conditional random fields in biology: www.cis.upenn.edu/~pereira/papers/crf.pdf
    • Conditional Random Fields tutorial: http://people.cs.umass.edu/~mccallum...f-tutorial.pdf

    What Have We Learned?

    In this section, the main contents we covered are as following:

    • First, we introduced the motivation behind adopting Hidden Markov Models in our analysis of genome annotation.

    • Second, we formalized Markov Chains and HMM under the light of weather prediction example.
    • Third, we got a sense of how to apply HMM in real world data by looking at Dishonest Casino and CG-rich region problems.
    • Fourthly, we systematiclly introduced algorithmic settings of HMM and went into detail of three of them:

      – Scoring: scoring over single path
      – Scoring: scoring over all paths
      – Decoding: Viterbi coding in determing most likely path

    • Finally, we discussed the possibility of introducing memory in the analysis of HMM and provided further readings for interested readers.

    Bibliography

    [1] Introduction to GHMMs: www.cs.tau.ac.il/~rshamir/algmb/00/scribe00/html/lec07/node28. html.

    [2] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological sequence analysis. eleventh edition, 2006.


    This page titled 7.7: Further Reading, What Have We Learned? is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Manolis Kellis et al. (MIT OpenCourseWare) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.