A recurring theme of this work is to take a global computational approach to analyzing elements of genes and RNAs encoded in the genome and use it to find interesting new biological phenomena. We can do this by seeing how individual examples “diverge” or differ from the average case. For example, by examining many protein–coding genes, we can identify features representative of that class of loci. We can then come up with highly accurate tests for distinguishing protein–coding from non–protein–coding genes. Often, these computational tests, based on thousands of examples, will be far more definitive than conventional low– throughput wet lab tests. (Such tests can include mass spectrometry to detect protein products, in cases where we want to know if a particular locus is protein coding.)
Motivation and Challenge
As the cost of genome sequencing continues to drop, the availability of sequenced genome data has exploded. However, analysis of the data has not kept up, while there are many interesting biological phenomena lying undiscovered in the endless strings of ATGCs. The goal of comparative genomics is to leverage the vast amounts of information available to look for biological patterns.
As the name suggests, comparative genomics does not focus on one specific set of genomes. The problem with purely focusing on the single genome level is that key evolutionary signatures are missed. Comparative genomics solves this problem by comparing genomes from many species that evolved from a common ancestor. As evolution changes a species’s genome, it leaves behind traces of its presence. We will see later in this chapter that evolution discriminates between portions of a genome on the basis of biological function. By exploiting this correlation between evolutionary fingerprints and the biological role of a genomic subsequence, comparative genomics is able to direct wet lab research to interesting portions of the genome and discover new biological phenomena.
Q: Why do mutations only accumulate in certain regions of the genome, whereas other regions are conserved?
A: In non-functional regions of DNA, accumulated mutations are kept because they do not disturb the function of the DNA. In functional regions, these mutations can lead to decreased fitness; these mutations are then discarded from the species by natural selection.
We can glean much information about evolution through studying genomics, and, similarly, we can learn about the genome through studying evolution. For example, from the principle of “survival of the fittest,” we can compare related species to discover which portions of the genome are functional elements. The evolutionary process introduces mutations into any genome. In non-functional regions of DNA, accumulated mutations are kept because they do not disturb the function of the DNA. However, in functional regions, accumulated mutations often lead to decreased fitness. Thus, these fitness-decreasing mutations are not likely to perpetuate to future generations. As time progresses, evolutionarily unfit organisms are likely to not survive and their genes thin out. By comparing surviving species’ genomes with their ancestors’ genomes, we can see which portions constitute functional elements and which constitute “junk DNA.”
To date various important biological markers and phenomena have been discovered through comparative genomics methods. For example, CRISPRs (Clustered Regularly Interspaced Short Palindromic Repeats), found in bacteria and archaea, were first discovered through comparative genomics. Follow–up experiments revealed that they provide adaptive immunity to plasmids and phages. Another example, which we will look at later in this chapter, is the phenomenon of stop–codon read–through, where stop codons are occasionally ignored during the process of translation phase of protein biosynthesis. Without comparative genomics to guide them, experimentalists might have ignored both of these features for many years.
Without a system for interpreting and identifying important features in genomes, all of the DNA sequences on earth are just a meaningless sea of data. However, we cannot ignore the importance of both computer science and biology in comparative genomics. Without knowledge of biology, one might miss the signatures of synonymous substitutions or frame shift mutations. On the other hand, ignoring computational approaches would lead to an inability to parse ever larger datasets emerging from sequencing centers. Comparative genomics require rare multidisciplinary skills and insight.
This is a particularly exciting time to enter the field of comparative genomics, because the field is mature enough that there are tools and data available to make discoveries. But it is young enough that important findings will likely continue to be made for many years.
Importance of many closely–related genomes
In order to resolve significant biological features we need both sufficient similarity to enable comparison and sufficient divergence to identify signatures of change over evolutionary time. This is difficult to achieve in a pairwise comparison. We improve the resolution of our analysis by extending analysis to many genomes simultaneously with some clusters of similar organisms and some dissimilar organisms. A simple analogy is one of observing an orchestra. If you place a single microphone, it will be difficult to decipher the signal coming from the entire system, because it will be overwhelmed by the local noise from the single point of observation, the nearest instrument. If you place many microphones distributed across the orchestra at reasonable distances, then you get a much better perspective not only on the overall signal, but also on the structure of the local noise. Similarly, by sequencing many genomes across the tree of life we are able to distinguish the biological signals of functional elements from the noise of neutral mutations. This is because nature selects for conservation of functional elements across large phylogenetic distances while constantly introducing noise through mutagenic processes operating at shorter time scales.
In this chapter, we will assume that we already have a complete genome–wide alignment of multiple closely–related species, spanning both coding and non–coding regions. In practice, constructing complete genome assemblies and whole–genome alignments is a very challenging problem; that will be the topic of the next chapter.
Q: Why is there more resolving power when the evolutionary distance or branch length between species increases?
A: If we are comparing two species like human and chimp that are very close to each other, we expect to see little to no mutations. This gives us little discriminative power because we see no difference between the number of mutations in functional elements vs. the number of mutations in non-functional elements. However, as we increase the evolutionary time between species, we expect to see more mutations, but what we actually see are a notable decrease in the observed number of mutations in certain regions of the genome. We can conclude that these regions are functional regions. Therefore, our confidence in perceived functional elements increases as branch length increases.
Q: Why is it better to have many closely related species for the same branch length rather than one distantly related species?
A: As branch length increases between distantly related species, even functional elements are not conserved. Furthermore, reliably aligning genes from distantly related relatives of the same species is difficult if not impossible using current technology such as BLAST.
Comparative genomics and evolutionary signatures
Given a genome-wide alignment, we can subsequently analyze the level of conservation of functional elements in each of the genomes considered. Using the UCSC genome browser, one may see a level of conservation for every gene in the human genome derived from aligning the genomes of many other species. In Figure 4.1 below, we see a DNA sequence represented on the x–axis, while each “row” represents a different species. The y–axis within each row represents the amount of conservation for that species in that part of the chromosome (though other species that are not shown were also used to calculate conservation). Higher bars correspond with greater conservation.
From this figure, we can see that there are blocks of conservation separated by regions that are not conserved. The 12 exons (highlighted by red rectangles) are mostly conserved across species, but sometimes, certain exons are missing; for example, zebrafish is missing exon 9. However, we also see that there is a spike in some species (as circled in red) that do not correspond to a known protein coding gene. This tells us that some intronic regions have also been evolutionarily conserved, since DNA regions that do not code for proteins can still be important as functional elements, such as RNA, microRNA, and regulatory motifs. By observing how regions are conserved, instead of just looking at the amount of conservation, we can observe ‘evolutionary signatures’ of conservation for different functional elements.
The pattern of mutation/insertion/deletion can help us distinguish different types of functional elements in the genome. Different functional elements are under different selective pressures and by considering which selective pressures each element is under, we can develop evolutionary signatures characteristic of each function. For example, we see the difference in evolutionary signatures as exhibited by protein-coding genes as opposed to regulatory motifs...etc.
Q: Given an alignment of genes from multiple species, what can you measure to determine the level of conservation of a specific gene(s)?
A: One simple method is just to look at the alignment score for each gene. If one wants to distinguish between highly conserved protein coding segments from non-protein coding segments, one may also look at codon conservation. However, in both of these approaches, we have to consider the position of each species being compared in the phylogenetic tree. A pairwise comparison score that is lower between two species separated by a greater distance in the phylogenetic tree than the pairwise score between two closely related species would not necessarily imply lower conservation.