CpG islands are defined as regions within a genome that are enriched with pairs of C and G nucleotides on the same strand. Typically, when this dinucleotide is present within a genome, it becomes methylated, and when deamination of the cytosine occurs, as it does at some base frequency, it becomes a thymine, another natural nucleotide, and thus cannot as easily be recognized by the cell as a mutation, causing a C to T mutation. This increased mutation frequency at CpG islands depletes CpG islands over evolutionary time and renders them relatively rare. Because the methylation can occur on either strand, CpGs usually mutate into a TpG or a CpA. However, when situated within an active promoter, methylation is suppressed, and CpG dinucleotides are able to persist. Similarly, CpGs in regions important to cell function are conserved due to evolutionary pressure. As a result, detecting CpG islands can highlight promoter regions, other transcriptionally active regions, or sites of purifying selection within a genome.
Did You Know?
CpG stands for [C]ytosine - [p]hosphate backbone - [G]uanine. The ’p’ implies that we are referring to the same strand of the double helix, rather than a G-C base pair occurring across the helix.
Given their biological significance, CpG islands are prime candidates for modelling. Initially, one may attempt to identify these islands by scanning the genome for fixed intervals rich in GC. This approach’s efficacy is undermined by the selection of an appropriate window size; while too small of a window may not capture all of a particular CpG island, too large of a window would result in missing many smaller but bona fide CpG islands. Examining the genome on a per codon basis also leads to difficulties because CpG pairs do not necessarily code for amino acids and thus may not lie within a single codon. Instead, HMMs are much better suited to modelling this scenario because, as we shall shortly see in the section on unsupervised learning, HMMs can adapt their underlying parameters to maximize their likelihood.
Not all HMMs, however, are well suited to this particular task. An HMM model that only considers the single nucleotide frequencies of C’s and G’s will fail to capture the nature of CpG islands. Consider one such HMM with the two following hidden states :
• ’+’ state representing CpG islands
• ’-’ state: representing non-islands
Each of these two states then emits A, C, G and T bases with a certain probability. Although the CpG islands in this model can be enriched with C’s and G’s by increasing their respective emission probabilities, this model will fail to capture the fact that the C’s and G’s predominantly occur in pairs.
Because of the Markov property that governs HMM’s, the only information available at each time step must be contained within the current state. Therefore, to encode memory within a Markov chain, we need to augment the state space. To do so, the individual ’+’ and ’-’ states can be replaced with 4 ’+’ states and 4 ’-’ states: A+, C+, G+, T+, A-, C-, G-, T- (Figure 8.4). Specifically, there are 2 ways to model this, and this choice will result in different emission probabilities:
- One model suggests that the state A+, for instance, implies that we are currently in a CpG island and the previous character was an A. The emission probabilities here will carry most of the information and the transitions will be fairly degenerate.
- Another model suggests that the state A+, for instance, implies that we are currently in a CpG island and the current character is an A. The emission probability here will be 1 for A and 0 for all other letters and the transition probabilities will bear most of the information in the model and the emissions will be fairly degenerate. We will assume this model from now on.
Did You Know?
The number of transitions is the square of the number of states. This gives a rough idea of how increasing HMM “memory” (and hence states) scale.
- The memory of this system derives from the fact that each state can only emit one character and therefore “remembers” its emitted character. Furthermore, the dinucleotide nature of the CpG islands is incorporated within the transition matrices. In particular, the transition frequency from C+ to G+ states is significantly higher than from C− to a G− states, demonstrating that these pairs occur more often within the islands.
Q: Since each state emits only one character, can we then say this reduces to a Markov Chain instead of a HMM?
A: No. Even though the emissions indicate the letter of the hidden state, they do not indicate if the state is a CpG island or not: both an A- and an A+ state emit only the observable A.
Q: How do we incorporate our knowledge about the system while training HMM models eg. some emission probabilities of 0 in the CpG island detection case?
A: We could either force our knowledge on the model by setting some parameters and leaving others to vary or we could let the HMM loose on the model and let it discover those relationships. As a matter of fact, there are even methods that simplify the model by forcing a subset of parameters to be 0 but allowing the HMM to choose which subset.
Given the above framework, we can use posterior decoding to analyze each base within a genome and determine whether it is most likely a constituent of a CpG island or not. But having constructed the expanded HMM model, how can we verify that it is in fact better than the single nucleotide model? We previously demonstrated that the forward or backward algorithm can be used to calculate P(x) for a given
model. If the likelihood of our dataset is higher given the second model than the first model, it most likely captures the underlying behavior more effectively.
However, there is one risk in complicating the model, which is overfitting. Increasing the number of parameters for an HMM makes the HMM more likely to overfit the data and be less accurate in capturing the underlying behavior. A common solution to this in machine learning is to use regularization, which is essentially using fewer parameters. In this case, it is possible to reduce number of parameters to learn by constraining all +/- transition probabilities to be the same value and all -/+ transition probabilities to be the same value, as the transitions back and forth from the + and - states are what we are interested in modeling, and the actual bases where the transition occurred are not that important to our model. Thus for this constrained model we have to learn fewer parameters which leads to a simpler model and can help to avoid overfitting.
Q: Are there other ways to encode the memory for CpG island detection? A: Other ideas that may be experimented with include
- Emit dinucleotides and figure out a way to deal with overlap.
- Add a special state that goes from C to G.