Skip to main content
Biology LibreTexts

26.4: Character-Based Methods

  • Page ID
    41081
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    page432image61141600.png
    Figure 26.14: An overview of the character based methods

    In character-based methods, the goal is to first create a valid algorithm for scoring the probability that a given tree would produce th observed sequences at its leaves, then to search through the space of possible trees for a tree that maximizes that probability. Good algorithms for tree scoring, and while searching the space of trees is theoretically NP-Hard (Due to the large number of possible trees), tractable heuristic search methods can in many cases find good trees. We'll first discuss tree scoring algorithms, then search techniques.

    Scoring

    There are two main algorithms for tree scoring. The first approach, which we will call parsimony reconstruction, is based on Occam’s razor, and scores a topology based on the minimum number of mutations it implies, given the (known) sequences at the leaves. This method is simple, intuitive, and fast. The second approach is a maximum likelihood method which scores trees by explicitly modeling the probability of observing the sequences at the leaves given a tree topology.

    Parsimony

    Conceptually, this method is simple. It simply assigns a value of for each base pair at each ancestral node such that the number of substitutions is minimized. The score is then just the sum over all base pairs of that minimal number of mutations at each base pair. (Recall that the eventual goal is to find a tree that minimizes that score.)

    To reconstruct the ancestral sequences at internal nodes on the tree, the algorithm first scans up from the (known) leaf sequences, assigning a set of bases at each internal node based on its children. Next, it iterates down the tree, picking bases out of the allowed sets at each node, this time based on the node’s parents. The following illustrates this algorithm in detail (note that there are 2N-1 total nodes, indexed from the root, such that the known leaf nodes have indices N-1 through 2N-1):

    page433image61267264.png
    Figure 26.15: Parsimony scoring: union and intersection
    page434image61278240.png
    Figure 26.16: Parsimony traceback to find ancestral neucleotides
    page434image61278448.png
    Figure 26.17: Parsimony scoring by dynamic programming

    As we mentioned before, this method is simple and fast. However, this simplicity can distort the scores it assigns. For one thing, the algorithm presented here assumes that a given base pair undergoes a substitution along at most one branch from a given node, which may lead it to ignore highly probably internal sequences that violate this assumption. Furthermore, this method does not explicitly model the time represented along each edge, and thus cannot account for the increased chance of a substitution along edges that represent a long temporal duration, or the possibility of different mutation rates across the tree. Maximum likelihood methods largely resolve these shortcomings, and are thus more commonly used for tree scoring.

    Maximum Likelihood - Peeling Algorithm

    As with the general Maximum likelihood methods, this algorithm scores a tree according to the (log) joint probability of observing the data and the given tree, i.e. P(D,T). The peeling algorithm again considers individual base pairs and assumes that all sites evolve independently. As in the parsimony method, this algorithm considers all base pairs independently: it calculates the probability of observing the given characters at each base pair in the leaf nodes, given the tree, a set of branch lengths, and the maximum likelihood assignment of the internal sequence, then simply multiplies this probabilities over all base pairs to get the total probability of observing the tree. Note that the explicit modeling of branch lengths is a difference from the previous approach.

    page436image61288848.png
    Figure 26.18: A tree to be scored using the peeling algorithm. n=4

    Here each node has a character xi and ti is the corresponding branch length from its parent. Note that we already know the values x1, x2···xn, so they are constants, but xn+1,···x2n-1 are unknown characters at ancestral nodes which are variables to which we will assign maximum likelihood values. (Also note that we have adopted a leaves-to-root indexing scheme for the nodes, the opposite of the scheme we used before.) We want to compute P (x1x2 · · · xn|T ). For this we sum over all possible combinations of values at the ancestral nodes. this is called marginalization. In this particular example

    \[P\left(x_{1} x_{2} x_{3} x_{4} \mid T\right)=\sum_{x_{5}} \sum_{x_{6}} \sum_{x_{7}} P\left(x_{1} x_{2} \cdots x_{7} \mid T\right)\nonumber\]

    There are 4n-1 terms in here, but we can use the following factorization trick:

    \[=\sum_{x_{7}}\left[P\left(x_{7}\right)\left(\sum_{x_{5}} P\left(x_{5} \mid x_{7}, t_{5}\right) P\left(x_{1} \mid x_{5}, t_{1}\right) P\left(x_{2} \mid x_{5}, t_{2}\right)\right)\left(\sum_{x_{6}} P\left(x_{6} \mid x_{7}, t_{6}\right) P\left(x_{3} \mid x_{6}, t_{3}\right) P\left(x_{4} \mid x_{6}, t_{4}\right)\right)\right] \nonumber\]

    Here we assume that each branch evolves independently. And the probability P(b|c,t) denotes the probability of base c mutating to base b given time t, which is essentially obtained from the Jukes Cantor model or some more advanced model discussed earlier. Next we can move the factors that are independent of the summation variable outside the summation. That gives:

    \[=\sum_{x_{7}}\left[P\left(x_{7}\right)\left(\sum_{x_{5}} P\left(x_{5} \mid x_{7}, t_{5}\right) P\left(x_{1} \mid x_{5}, t_{1}\right) P\left(x_{2} \mid x_{5}, t_{2}\right)\right)\left(\sum_{x_{6}} P\left(x_{6} \mid x_{7}, t_{6}\right) P\left(x_{3} \mid x_{6}, t_{3}\right) P\left(x_{4} \mid x_{6}, t_{4}\right)\right)\right]\nonumber\]

    Let Ti be the subtree below i. In this case, our 2n-1\(\times\)4 dynamic programming array computes L[i,b], the probability P(Ti|xi = b) of observing Ti, if node i contains base b. Then we want to compute the probability of observing T = T2n-1, which is

    \[\sum_{b} P\left(x_{2 n-1}=b\right) L[2 n-1, b]\nonumber\]

    Note that for each ancestral node i and its childer j, k, we have

    \[L[i, b]=\left(\sum_{c} P\left(c \mid b, t_{j}\right) L[j, c]\right)\left(\sum_{c} P\left(c \mid b, t_{k}\right) L[k, c]\right)\nonumber\]

    Subject to the initial conditions for the leaf nodes, i.e. for i \(\leq\) n:

    L[i, b] = 1 if xi = b and 0 otherwise

    page436image61289056.png
    Figure 27.19: The recurrence

    Note that we still do not have the values P (x2n-1 = b). It is usually assigned equally or from some prior distribution, but it does not affect the results greatly. The final step is of course to multiply all the probabilities for individual sites to obtain the probability of observing the set of entire sequences. In addition, once we have assigned the maximum likelihood values for each internal node given the tree structure and the set of branch lengths, we can multiply the resulting score by some prior probabilities of the tree structure and the set of branch lengths, which are often generated using explicit modeling of evolutionary processes, such as the Yule process or birth-death models like the Moran process. The result of this final multiplication is called the a posteriori probability, using the language of Bayesian inference. The overall complexity of this algorithm is O(nmk2) where n is the number of leaves (taxa), m is the sequence length, and k is the number of characters.

    There are addvantages and disadvantages of this algorithm. Such as

    Advantages:

    1. Inherently statistical and evolutionary model-based.
    2. Usually the most consistent of the methods available.
    3. Used for both character and rate analyses
    4. Can be used to infer the sequences of the extinct ancestors.

    5. Account for branch-length effects in unbalanced trees.
    6. Nucleotide or amino acid sequences, other types of data.

    Disadvantages:

    1. Not as simple and intuitive as many other methods.
    2. Computationally intense Limited by, number of taxa and sequence length).

    3. Like parsimony, can be fooled by high levels of homoplasy.
    4. Violations of model assumptions can lead to incorrect trees.

    Search

    A comprehensive search over the space of all trees would be extremely costly. The number of full rooted trees with n + 1 leaves is the n-th catalan number

    \[C_{n}=\frac{1}{n+1}\left(\begin{array}{c}
    2 n \\
    n
    \end{array}\right) \approx \frac{4^{n}}{n^{3 / 2} \sqrt{\pi}}\nonumber\]

    Moreover, we must compute the maximum likelihood set of branch lengths for each of these trees. Thus, it is an NP-Hard problem to maximize the score absolutely for all trees. Fortunately, heuristic search algorithms can generally identify good solutions in the tree space. The general framework for such search algorithms is as follows:

    Inititalization: Take some tree as the base of iteration (randomly or according to some other prior, or from the distance based direct algorithms).

    Proposal: Propose a new tree by randomly modifying the current tree slightly.

    Score: Score the new proposal according to the methods described above.

    Select: Randomly select the new tree or the old tree (corresponding probabilities according to the score (likelihood) ratio.

    Iterate: Repeat to proposal step unless some termination criteria is met (some threshold score or number of steps reached.

    the basic idea here is the heuristic assumption that the scores of closely related trees are similar, so that good solutions may be obtained by successive local optimization, which is expected to converge towards a overall good solution.

    Tree Proposal

    One method for modifying trees is the Nearest Neighbor Exchange (NNI), illustrated below.

    Figure 27.20: An unit step using Nearest Neighbor Interchange scheme

    Another common method, not described here, is Tree Bisection and Join (TBJ). The important criteria for such proposal rules is that:

    1. (a) The tree space should be connected, i.e. any pair of trees should be obtainable from each other by successive proposals.

    2. (b) An individual new proposal should be suciently close to the original. So that it is more likely to be a good solution by virtue of the proximity to an already discovered good solution. If individual steps are too big, the algorithm may move away from an already discovered solution (also depends on the selection step). In particular, note that the measure of similarity by which the measure these step sizes is precisely the di↵erence in the likelihood scores assigned to the two trees.

    Selection

    Choosing whether or not to adopt a given proposal, like the process of generating the proposal itself, is inherently heuristic and varies. A general rules of thumb is:

    1. If the new one has a better score, always accept it.

    2. If it has a worse score, there should be some probability of selecting it, otherwise the algorithm will soon fixate in a local minima, ignoring better alternatives a little far away.

    3. There should not be too much probability of selecting an worse new proposal, otherwise, it risks rejecting a known good solution.

    It is the trade-off between the steps 2 and 3 that determines a good selection rule. Metropolis Hastings is a Markov Chain Monte Carlo Method (MCMC) that defines specific rules for exploring the state space in a way that makes it a sample from the posterior distribution. These algorithms work somewhat well in practice, but there is no guarantee for finding the appropriate tree. So a method known as bootstrapping is used, which is basically running the algorithm over and over using subsets of the base pairs in the leaf sequences,. then favoring global trees that match the topologies generated by using only these subsequences.


    26.4: Character-Based Methods is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

    • Was this article helpful?