Functional elements in Drosophila
In a 2007 paper1, Stark et al. identified evolutionary signatures of different functional elements and predicted function using conserved signatures. One important finding is that across evolutionary time, genes tend to remain in a similar location. This is illustrated by Figure 4.2, which shows the result of a multiple alignment on orthologous segments of genomes from twelve Drosophila species. Each genome is represented by a horizontal blue line, where the top line represents the reference sequence. Grey lines connect orthologous functional elements, and it is clear that their positions are generally conserved across the different species.
Q: Why is it significant that the position of orthologous elements is conserved?
A: The fact that positions are conserved is what allows us to make comparisons across species. Otherwise, we would not be able to align non-coding regions reliably.
Drosophila is a great species to study because, in fact, the separation of fruit flies is greater than that of mammals. This brings us to an interesting side-note, that of which species to select when looking at conservation signatures. You don’t want to have very similar species (such as humans and chimpanzees, which share 98% of the genome), because it would be difficult to distinguish regions that are different from ones that are the same. When comparing species to humans, the right level of conservation to look at is the mammals. Specifically, most research done in this field is done using 29 eutherian mammals (placental mammals, no marsupials or monotremes) to study. Another things to take into account is branch-length differences between two species. Your ideal subjects of study would be a few closely related (short branch- length) species, to avoid problems of interpretation that arise with a long branch-length mutations, such as back-mutations.
Rates and patterns of selection
Now that we have established that there is structure to the evolution of genomic sequences, we can begin analyzing specific features of the conservation. For this section, let us consider genomic data at the level of individual nucleotides. Later on in this chapter we will see that we can also analyze amino acid sequences.
We may estimate the intensity of a constraint of selection ω by making a probabilities model of the substitution rate inferred from genome alignment data. Using a Maximum Likelihood (ML) estimation of ω can provide us with the rate of selection ω as well as the log odds score that the rate is non-natural.
One property that this measures that we may consider is the rate of nucleotide substitution in a genome. Figure 4.3 shows two nucleotide sequences from a collection of mammals. One of the sequences is subject to normal rates of change, while the other demonstrates a reduced rate. Hence we may hypothesize that the latter sequence is subject to a greater level of evolutionary constraint, and may represent a more biologically important section of the genome.
We can further detect unusual patterns of selection π by looking at a probabilistic model of a stationary distribution that is different from the background distribution. The ML estimation of π provides us with the Probability Weight Matrix (PWM) for each k-mer in the genome as well as the log odds score for substitutions that are unusual (e.g. one base changing to one and only one other base). As one may see from Figure 4.4, specific letters matter because some bases selectively change to one (or two other bases), and the specific base it changes to may suggest what the function of the sequence may be.
We can increase our detection power of constraint elements by looking at more species, as shown in Figure 4.5 where we see a dramatic increase in the power to detect small constrained elements.