Skip to main content
Biology LibreTexts

10.5: RNA Folding Problem and Approaches

  • Page ID
    40978
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    Finally, we get to the point where we want to study the RNA structure. The goal here is to predict the secondary structure of the RNA, given its primary structure (or its sequence). The good news is we can find the optimal structure using dynamic programming. Now in order to set up our dynamic programming framework we would need a scoring scheme, which we would create using the contribution of each base pairing to the physical stability of the molecule. In other words, we want to create a structure with minimum free energy, in in our simple model we would assign each base pair an energy value. 10.7


    +10 +10 +10.png
    Figure 10.7: Example of a scoring scheme for base pair matches. Note that G-U can form a wobble pair in RNA.

    The optimum structure is going to be the one with a minimum free energy and by convention negative energy is stabilizing, and positive energy is non-stabilizing. Using this framework, we can use dynamic programming (DP) to calculate the optimal structure because 1) this scoring scheme is additive 2) we disallowed pseudo knots, which means we can divide the RNA into two smaller ones which are independent, and solve the problem for these smaller RNAs.

    We want to find a DP matrix Eij, in which we calculate the minimum free energy for subsequence i to j. The first approach to this is Nussinov’s algorithm.

    Nussinov’s algorithm

    The recursion formula for this problem was first described by Nussinov in 1978.

    The intuition behind this algorithm is as follows: given a subsequence [i,j], there is either no edge connecting to the ith base (meaning it is unpaired) or there is some edge connecting the ith base to the kth base where i < k ≤ j (meaning the ith base is paired to the kth base). In the case were the ith base is unpaired, the energy of the subsequence, Ei,j , simply reduces to the energy of the subsequence from i + 1 to j, Ei+1,j. This is the first term of the Nussinov recurrence relation. If the ith base is paired to the kth base, however, then Ei,j reduces to the energy contribution of the i,k pairing, βi,k, plus the energy of the subsequences formed by dividing [i + 1, j] around k, Ei+1,k−1 and Ek+1,j . Choosing the k which minimizes that value yields the second term of the Nussinov recurrence relation. The optimal subsequence energy, therefore, is the minimum of the subsequence energy when the ith base is paired with the optimal kth base and when the ith base is unpaired. This produces the overall relation described in figure 10.8.

    page198image51623536.png
    Figure 10.8: The recursion formula for Nussinov algorithm, along with a graphical depiction of how it works.

    From this recurrence relation, we can see that the DP matrix will contain entries for all i,j where 1≤i≤nandi≤j≤nandnisthelengthoftheRNAsequence. In other words,the matrix will be n ∗ n and only contain entries in the upper right triangle. The matrix is first initialized such that all values on the diagonal are equal to zero. We then iterate over i = n − 1...1 and j = i + 1...n (bottom to top, left to right) and fill each entry according to the recurrence relation. The overall score is the score of the [1, n] subsequence, which is the upper right corner of the matrix. Figure 10.9 illustrates this procedure.

    When we calculate the minimum free energy, we are often interested in the corresponding fold. In order to recover the optimal fold from the DP algorithm, a traceback matrix is used to store pointers from each entry to its parent entry. Figure 10.10 describes the backtracking algorithm.

    This model is very simplistic and there are some limitations to it. Nussinov’s algorithm, as implemented naively, does not take into account some of the limiting aspects of RNA folding. Most importantly, it does not consider stacking interactions between neighboring pairs, a vital factor (even more so than hydrogen bonds) in RNA folding. Figure 10.11

    Therefore, it is desirable to integrate biophysical factors into our prediction. One improvement, for instance, is to assign energies to graph faces (structural elements in figure 10.12), rather than single base pairs. The total energy of the structure then becomes the sum of the energies of the substructures. The stacking energies can be calculated by melting oligonucleotides experimentally.


    page199image51797312.png
    © Stefan Washietl. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    Figure 10.9: The diagonal is initialized to 0. Then, the table is filled bottom to top, left to right ac- cording to the recurrence relation. In this example, complimentary base pairings are scored as -1 and non-complementary pairings are scored as 0. The optimal score for the entire sequence is found in the upper right corner.

    Single base-pair.png
    © Stefan Washietl. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    Figure 10.11: Stacking between neighboring base pairs in RNA. The flat aromatic structure of the base causes quantum interactions between stacked bases and changes its physical stability.

    Zuker Algorithm

    Therefore, we use a variant which includes stacking energies to calculate the RNA structure. This is called the Zuker algorithm. Like Nussinovs, it assumes that the optimal structure is the one with the lowest equilibrium free energy. Nevertheless, it includes the total energy contributions from the various substructures which is partially determined by the stacking energy. Some modern RNA folding algorithms use this algorithm for RNA structure predictions.

    In the Zuker algorithm, we have four different cases to deal with. Figure 10.13 shows a graphical outline of the decomposition steps. The procedure requires four matrices. Fij contains the free energy of the overall optimal structure of the subsequence xij. The newly added base can be unpaired or it can form a pair. For

    page200image51872992.png
    Figure 10.12: Various internal substructures in a folded RNA. A hairpin is consisted of a terminal loop connected to a paired region, an internal loop is an unpaired region within the paired region. A Bulge is a special case of an interior loop with a single mis-pair. a Multi loop is a loop which consists of multiple of these components (in this example two hairpins and a paired region, all connected to a loop). © Stefan Washietl. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    the latter case, we introduce the helper matrix Cij, that contains the free energy of the optimal substructure of xij under the constraint that i and j are paired. This structure closed by a base-pair can either be a hairpin, an interior loop or a multi-loop.

    The hairpin case is trivial because no further decomposition is necessary. The interior loop case is also simple because it reduces again to the same decomposition step. The multi-loop step is more complicated. The energy of a multi loop depends on the number of components, i.e. substructures that emanate from the loop. To implicitly keep track of this number, there is a need for two additional helper matrices. Mij holds the free energy of the optimal structure of xij under the constraint that xij is part of a multi loop with at least one component. Mij1 holds the free energy of the optimal structure of xij under the constraint that xij is part of a multi-loop and has exactly one component closed by pair (i,k) with i < k < j. The idea is to decompose a multi loop in two arbitrary parts of which the first is a multi-loop with at least one component and the second a multi-loop with exactly one component and starting with a base-pair.

    These two parts corresponding to M and M1 can further be decomposed into substructures that we already know, i.e. unpaired intervals, substructures closed by a base-pair,or (shorter) multi-loops. (The recursions are also summarized in 10.13.


    r(i.n., min Ickic Cw+ I(.k.9). min Mi-t.u+ Maris-i+9.png
    Figure 10.13: F describes the unpaired case, C is described by one of the three conditions : hairpin,interior loop, or a composition of structures i.e. a multi loop. M1 is a multi loop with only one component, where are M might have multiple of them. The | icon is notation for “or”. © Stefan Washietl. All rights reserved. This content is excluded from our Creative Commons license. For more information, see http://ocw.mit.edu/help/faq-fair-use/.

    In reality, however, at room temperature (or cell temperature), RNA is not actually in one single state, but rather varies in a Thermodynamic ensemble of structure. Base pairs can break their bonds quite easily, and although we might find an absolute optimum in terms of free energy, it might be the case that there is another sub-optimal structure which is very different from what e predicted and has an important role in the cell. To fix the problem we can calculate the base pair probabilities to get the ensemble of structures, and then we can have a much better idea of what the RNA structure probably looks like. In order to do this, we utilize the Boltzman factor:

    \[\operatorname{Prob}(\mathcal{S})=\frac{\exp (-\Delta G(\mathcal{S}) / R T)}{Z}\nonumber\]

    This gives us the probability of a given structure, in a thermodynamic system. We need to normalize the temperature using the partition function Z,which is the weighted sum of all structures, based on their Boltzman factor:

    \[z=\sum_{s} \exp (-\Delta G(\mathcal{S}) / R T)\nonumber\]

    We can also represent this ensemble graphically, using a dot plot to visualize the base pair probabilities. To calculate the specific probability for a base pair (i, j) , we need to calculate the partition function, which is given by the following formula :

    \[p_{i j}=\frac{\widehat{Z}_{i} Z_{i+1,-1} \exp \left(-\beta_{j} / R T\right)}{Z}\nonumber\]

    To calculate Z (the partition function over the whole structure), we use the recursion similar to the Nussinovs Algorithm (known as McCaskill Algorithm).The inner partition function is calculated using the formula:

    \[Z_{i j}=Z_{i+1, j}+\sum_{i+1 \leq k \leq j \atop n_{k}=1} Z_{i+1, k-1} Z_{k+1, j} \exp \left(-\beta_{i k} / R T\right)\nonumber\]

    With each of the additions corresponding to a different split in our sequence as the next figure illustrates. Note that the addition are multiplied to the energy functions since it is expressed as a exponential.

    page201image52118544.png
     

    Similarly the outer partition function is calculated with a the same idea using the formula:

    page201image52118752.png

    corresponding to different splits in the area outside the base pairs (i,j).


    page201image52118960.png

    This page titled 10.5: RNA Folding Problem and Approaches is shared under a CC BY-NC-SA 4.0 license and was authored, remixed, and/or curated by Manolis Kellis et al. (MIT OpenCourseWare) via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.