Skip to main content
Biology LibreTexts

7.2: Motivation

  • Page ID
    40953
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    You have a new sequence of DNA, now what?

    1. Align it:
      • with things we know about (database search).
      • with unknown things (assemble/clustering)
    2. Visualize it: “Genomics rule #1”: Look at your data!
      • Look for nonstandard nucleotide compositions.
      • Look for k-mer frequencies that are associated with protein coding regions, recurrent data, high GC content, etc.
      • Look for motifs, evolutionary signatures.
      • Translate and look for open reading frames, stop codons, etc.
      • Look for patterns, then develop machine learning tools to determine reasonable probabilistic models. For example by looking at a number of quadruples we decide to color code them to see where they most frequently occur.
    3. Model it:
      1. Make hypothesis.
      2. Build a generative model to describe the hypothesis.
      3. Use that model to find sequences of similar type.

    We’re not looking for sequences that necessarily have common ancestors. Rather, we’re interested in sequences with similar properties. We actually don’t know how to model whole genomes, but we can model small aspects of genomes. The task requires understanding all the properties of genome regions and computationally building generative models to represent hypotheses. For a given sequence, we want to annotate regions whether they are introns, exons, intergenic, promoter, or otherwise classifiable regions.

    page149image19597968.png
    Figure 7.1: Modeling biological sequences © source unknown. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    Building this framework will give us the ability to:

    • Emit (generate) sequences of similar type according to the generative model
    • Recognize the hidden state that has most likely generated the observation
    • Learn (train) large datasets and apply to both previously labeled data (supervised learning) and unlabeled data (unsupervised learning).

    In this lecture we discuss algorithms for emission and recognition.

    Why probabilistic sequence modeling?

    • Biological data is noisy.
    • Update previous knowledge about biological sequences.
    • Probability provides a calculus for manipulating models.
    • Not limited to yes/no answers, can provide degrees of belief.
    • Many common computational tools are based on probabilistic models.
    • Our tools: Markov Chains and HMM.

    This page titled 7.2: Motivation is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Manolis Kellis et al. (MIT OpenCourseWare) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.