# 8.4: Learning

- Page ID
- 40963

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

We saw how to score and decode an HMM-generated sequence in two different ways. However, these methods assumed that we already knew the emission and transition probabilities. While we are always free to hazard a guess at these, we may sometimes want to use a more data-driven, empirical approach to deriving these parameters. Fortunately, the HMM framework enables the learning of these probabilities when provided a set of training data and a set architecture for the model.

When the training data is labelled, estimation of the probabilities is a form of supervised learning. One such instance would occur if we were given a DNA sequence of one million nucleotides in which the CpG islands had all been experimentally annotated and were asked to use this to estimate our model parameters.

In contrast, when the training data is unlabelled, the estimation problem is a form of unsupervised learning. Continuing with the CpG island example, this situation would occur if the provided DNA sequence contained no island annotation whatsoever and we needed to both estimate model parameters and identify

the islands.

## Supervised Learning

When provided with labelled data, the idea of estimating model parameters is straightforward. Suppose that you are given a labelled sequence x_{1}, . . . , x_{N} as well as the true hidden state sequence π_{1}, . . . , π_{N}. Intuitively, one might expect that the probabilities that maximize the data’s likelihood are the actual probabilities that one observes within the data. This is indeed the case and can be formalized by defining Akl to be the number of times hidden state k transitions to l and E_{k}(b) to be the number of times b is emitted from hidden state k. The parameters θ that maximize P(x|θ) are simply obtained by counting as follows:

\[

\begin{aligned}

a_{k l} &=\frac{A_{k l}}{\sum_{i} A_{k i}} \\

e_{k}(b) &=\frac{E_{k}(b)}{\sum_{c} E_{k}(c)}

\end{aligned}

\]

One example training set is shown in Figure 8.5. In this example, it is obvious that the probability of transitioning from B to P is \(\begin{equation}

\frac{1}{3+1}=\frac{1}{4}

\end{equation}\) (there are 3 B to B transitions and 1 B to P transitions) and the probability of emitting a G from the B state is \(\begin{equation}

\frac{2}{2+2+1}=\frac{2}{5}

\end{equation}\) (there are 2 G's emitted from the B state, 2 C's and 1 A)

Notice, however, that in the above example the emission probability of character T from state B is 0because no such emissions were encountered in the training set. A zero probability, either for transitioning or emitting, is particularly problematic because it leads to an infinite log penalty. In reality, however, the zero probability may merely have arisen due to over-fitting or a small sample size. To rectify this issue and maintain flexibility within our model, we can collect more data on which to train, reducing the possibility that the zero probability is due to a small sample size. Another possibility is to use ’pseudocounts’ instead of absolute counts: artificially adding some number of counts to our training data which we think more accurately represent the actual parameters and help counteract sample size errors.

\[\begin{equation}

\begin{array}{l}

A_{k l}^{*}=A_{k l}+r_{k l} \\

E_{k}(b)^{*}=E_{k}(b)+r_{k}(b)

\end{array}

\end{equation}. \nonumber \]

Larger pseudocount parameters correspond to a strong prior belief about the parameters, reflected in the fact that these pseudocounts, derived from your priors, are comparatively overwhelming the observations, your training data. Likewise, small pseudocount parameters (r << 1) are more often used when our priors are relatively weak and we are aiming not to overwhelm the empirical data but only to avoid excessively harsh probabilities of 0.

## Unsupervised Learning

Unsupervised learning involves estimating parameters based on unlabelled data. This may seem impossible - how can we take data about which we know nothing and use it to ”learn”? - but an iterative approach can yield surprisingly good results, and is the typical choice in these cases. This can be thought of loosely as an evolutionary algorithm: from some initial choice of parameters, the algorithm assesses how well the parameters explain or relate to the data, uses some step in that assessment to make improvements on the parameters, and then assesses the new parameters, producing incremental improvements in the parameters at every step just as the fitness or lack thereof of a particular organism in its environment produces incremental increases over evolutionary time as advantageous alleles are passed on preferentially.

Suppose we have some sort of prior belief about what each emission and transition probability should be. Given these parameters, we can use a decoding method to infer the hidden states underlying the provided data sequence. Using this particular decoding parse, we can then re-estimate the transition and emission counts and probabilities in a process similar to that used for supervised learning. If we repeat this procedure until the improvement in the data’s likelihood remains relatively stable, the data sequence should ultimately drive the parameters to their appropriate values.

FAQ

Q: Why does unsupervised learning even work? Or is it magic?

A: Unsupervised learning works because we have the sequence (input data) and this guides every step of the iteration; to go from a labelled sequence to a set of parameters, the later are guided by the input and its annotation, while to annotate the input data, the parameters and the sequence guide the procedure.

For HMMs in particular, two main methods of unsupervised learning are useful.

**Expectation Maximization using Viterbi training**

The first method, **Viterbi training**, is relatively simple but not entirely rigorous. After picking some initial best-guess model parameters, it proceeds as follows:

**E step:** Perform Viterbi decoding to find π^{⋆}

**M step:** Calculate the new parameters \(A_{k l}^{*}, E_{k}(b)^{*}\) using the simple counting formalism in supervised learning (Maximization step)

**Iteration: **Repeat the E and M steps until the likelihood P(x|θ) converges

Although Viterbi training converges rapidly, its resulting parameter estimations are usually inferior to those of the Baum-Welch Algorithm. This result stems from the fact that Viterbi training only considers the most probable hidden path instead of the collection of all possible hidden paths.

**Expectation Maximization: The Baum-Welch Algorithm**

The more rigorous approach to unsupervised learning involves an application of Expectation Maximization to HMM’s. In general, EM proceeds in the following manner:

**Init: **Initialize the parameters to some best-guess state

**E step: **Estimate the expected probability of hidden states given the latest parameters and observed sequence (Expectation step)

**M step: **Choose new maximum likelihood parameters using the probability distribution of hidden states (Maximization step)

**Iteration: **Repeat the E and M steps until the likelihood of the data given the parameters converges

The power of EM lies in the fact that P (x|θ) is guaranteed to increase with each iteration of the algorithm. Therefore, when this probability converges, a local maximum has been reached. As a result, if we utilize a variety of initialization states, we will most likely be able to identify the global maximum, i.e. the best parameters θ. The Baum-Welch algorithm generalizes EM to HMM’s. In particular, it uses the forward and backward algorithms to calculate P(x|θ) and to estimate A_{kl }and E_{k}(b). The algorithm proceeds as follows:

**Initialization** 1. Initialize the parameters to some best-guess state

**Iteration **1. Run the forward algorithm

2. Run the backward algorithm

3. Calculate the new log-likelihood P(x|θ)

4. Calculate Akl and Ek(b)

5. Calculate akl and ek(b) using the pseudocount formulas 6. Repeat until P(x|θ) converges

Previously, we discussed how to compute P(x|θ) using either the forward or backward algorithm’s final results. But how do we estimate A_{kl} and E_{k}(b)? Let’s consider the expected number of transitions from state k to state l given a current set of parameters θ. We can express this expectation as

\[A_{k l}=\sum_{t} P\left(\pi_{t}=k, \pi_{t+1}=l \mid x, \theta\right)=\sum_{t} \frac{P\left(\pi_{t}=k, \pi_{t+1}=l, x \mid \theta\right)}{P(x \mid \theta)}\nonumber \]

Exploiting the Markov property and the definitions of the emission and transition probabilities leads to the following derivation:

\[\begin{aligned}

A_{k l} &=\sum_{t} \frac{P\left(x_{1} \ldots x_{t}, \pi_{t}=k, \pi_{t+1}=l, x_{t+1} \ldots x_{N} \mid \theta\right)}{P(x \mid \theta)} \\

&=\sum_{t} \frac{P\left(x_{1} \ldots x_{t}, \pi_{t}=k\right) * P\left(\pi_{t+1}=l, x_{t+1} \ldots x_{N} \mid \pi_{t}, \theta\right)}{P(x \mid \theta)} \\

&=\sum_{t} \frac{f_{k}(t) * P\left(\pi_{t+1}=l \mid \pi_{t}=k\right) * P\left(x_{t+1} \mid \pi_{t+1}=l\right) * P\left(x_{t+2} \ldots x_{N} \mid \pi_{t+1}=l, \theta\right)}{P(x \mid \theta)} \\

\Rightarrow A_{k l} &=\sum_{t} \frac{f_{k}(t) * a_{k l} * e_{l}\left(x_{t+1}\right) * b_{l}(t+1)}{P(x \mid \theta)}

\end{aligned}\]

A similar derivation leads to the following expression for E_{k}(b):

\[E_{k}(b)=\sum_{i \mid x_{i}=b} \frac{f_{k}(t) * b_{k}(t)}{P(x \mid \theta)}\]

Therefore, by running the forward and backward algorithms, we have all of the information necessary to calculate P(x|θ) and to update the emission and transition probabilities during each iteration. Because these updates are constant time operations once P(x|θ), f_{k}(t) and b_{k}(t) have been computed, the total time complexity for this version of unsupervised learning is θ(K^{2}NS), where S is the total number of iterations.

FAQ

Q: How do you encode your prior beliefs when learning with Baum-Welch?

A: Those prior beliefs are encoded in the initializations of the forward and backward algorithms