# 11.5: Sampling and birth-death models

- Page ID
- 21646

It is important to think about sampling when fitting birth-death models to phylogenetic trees. If any species are missing from your phylogenetic tree, they will lead to biased parameter estimates. This is because missing species are disproportionally likely to connect to the tree on short, rather than long, branches. If we randomly sample lineages from a tree, we will end up badly underestimating both speciation and extinction rates (and wrongly inferring slowdowns; see chapter 12).

Fortunately, the mathematics for incomplete sampling of reconstructed phylogenetic trees has also been worked out. There are two ways to do this, depending on how the tree is actually sampled. If we consider the missing species to be random with respect to the taxa included in the tree, then one can use a uniform sampling fraction to account for them. By contrast, we often are in the situation where we have tips in our tree that are single representatives of diverse clades (e.g. genera). We usually know the diversity of these unsampled clades in our tree of representatives. I will follow (Höhna et al. 2011; Höhna 2014) and refer to this approach as *representative sampling* (and the previous alternative as *uniform sampling*).

For the uniform sampling approach, we use the framework above of calculating backwards through time, but modify the starting points for each tip in the tree to reflect *f*, the probability of sampling a species (following Fitzjohn et al. (2009)):

(eq. 11.22)

*D*_{N}(0)=1 − *f*

*E*(0)=*f*

Repeating the calculations above along branches and at nodes, but with the starting conditions above, we obtain the following likelihood (FitzJohn et al. 2009):

(eq. 11.23)

$$ \begin{aligned} L(t_1, t_2, \dots, t_n) = \lambda^{n-1} \big[ \prod_{k = 1}^{2n-2} e^{(\lambda-\mu)(t_{k,b} - t_{k,t})} \cdot \\ \frac{(f \lambda - (\mu - \lambda(1-f))e^{(\lambda - \mu)t_{k,t}})^2}{(f \lambda - (\mu - \lambda(1-f))e^{(\lambda - \mu)t_{k,b}})^2} \big] \end{aligned} $$

Again, the above formula is proportional to the full likelihood, which is:

(eq. 11.24)

$$ L(\tau) = (n-1)! \frac{\lambda^{n-2} \big[ \prod_{k = 1}^{2n-2} e^{(\lambda-\mu)(t_{k,b} - t_{k,t})} \cdot \frac{(f \lambda - (\mu - \lambda(1-f))e^{(\lambda - \mu)t_{k,t}})^2}{(f \lambda - (\mu - \lambda(1-f))e^{(\lambda - \mu)t_{k,b}})^2} \big]}{[1-E(t_{root})]^2} $$

and:

(eq. 11.25)

$$ E(t_{root}) = 1 - \frac{\lambda-\mu}{\lambda - (\lambda-\mu)e^{(\lambda - \mu)t_{root}}} $$

For representative sampling, one approach is to consider the data as divided into two parts, phylogenetic and taxonomic. The taxonomic part is the stem age and extant diversity of the unsampled clades, while the phylogenetic part is the relationships among those clades. Following Rabosky and Lovette (2007), we can then calculate:

(eq. 11.26)

*L*_{total} = *L*_{phylogenetic} ⋅ *L*_{taxonomic}

Where *L*_{phylogenetic} can be calculated using equation 11.18 and *L*_{taxonomic} calculated for each clade using equation 10.16 and then multiplied to get the overall likelihood.

There are two extensions to this approach that are worth mentioning. One is Hohna's (2011) diversified sampling ("DS") model. This model makes a different assumption: when sampling n taxa from an overall set of m, the deepest *n* − 1 nodes have been included. Hohna's approach allows users to fit a model with representative sampling but without requiring assignment of extant diversity to each clade. Another approach, by Stadler and Smrckova (2016), calculates likelihoods for representatively sampled trees and can fit models of time-varying speciation and extinction rates (see chapter 12).