Another class of functional element that is highly conserved across many genomes contains regulatory motifs. A regulatory motif is a highly conserved sequence of nucleotides that occurs many times throughout the genome and serves some regulatory function. For instance, these motifs might characterize enhancers, promoters, or other genomic elements.
Computationally Detecting Regulatory Motifs
Computational methods have been developed to measure conservation of regulatory motifs across the genome, and to find new unannotated motifs de novo. Known motifs are often found in regions with high conservation, so we can increase our testing power by testing for conservation, and then finding signatures for regulatory motifs.
Evaluating the pattern of conservation for known motifs versus the “null model” of regions without motifs gives the following signature:
Gal4 (known motif region)
All intergenic regions
|Intergenic: coding||13%: 3%||2%:7%|
|Upstream: downstream||12: 0||1:1|
So as we can see, regions with regulatory motifs show a much higher degree of conservation in intergenic regions and upstream of the gene of interest.
To discover novel motifs, we can use the following pipeline:
- Pick a motif “seed” consisting of two groups of three non–degenerate characters with a variable size gap in the middle.
- Use a conservation ratio to rank the seed motifs
- Expand the seed motifs to fill in the bases around the seeds using a hill climbing algorithm.
- Cluster to remove redundancy.
Discovering motifs and performing clustering has led to the discovery of many motif classes, such as tissue specific motifs, function specific motifs, and modules of cooperating motifs.
Individual Instances of Regulatory Motifs
To look for expected motif regions, we can first calculate a branch–length score for a region suspected to be a regulatory motif, and then use this score to give us a confidence level of how likely something is to be a real motif.
The branch length score (BLS) sums evidence for a given motif over branches of a phylogenetic tree. Given the pattern of presence or absence of a motif in each species in the tree, this score evaluates the total branch length of the sub–tree connecting the species that contain the motif. If all species have the motif, the BLS is 100%. Note more distantly related species are given higher scores, since they span a longer evolutionary distance. If a predicted motif has spanned such a long evolutionary time frame, it is likely it is a functional element rather than just a region conserved by random chance.
To create a null model, we can choose control motifs. The null model motifs should be chosen to have the same composition as the original motif, to not be too similar to each other, and to be dissimilar from known motifs. We can get a confidence score by comparing the fraction of motif instances to control motifs at a given BLS score.