We have already explored the areas of dynamic programming, sequence alignment, sequence classification and modeling, hidden Markov models, and expectation maximization. In the following chapter, we will look at how these techniques are also useful in identifying novel motifs and elucidating their functions.
The regulatory code: Transcription Factors and Motifs
Motifs are short (6-8 bases long), recurring patterns that have well-defined biological functions. Motifs include DNA patterns in enhancer regions or promoter motifs, as well as motifs in RNA sequences such as splicing signals. As we have discussed, genetic activity is regulated in response to environmental variations. Motifs are responsible for recruiting Transcription Factors, or regulatory proteins, to the appropriate target gene. Motifs can also be recognized by microRNAs, which bind to motifs given through complementarity; nucleosomes, which recognize motifs based on their GC content; and other RNAs, which use a combination of DNA sequence and structure. Once bound, they can activate or repress the expression of the associated gene.
Transcription factors (TFs) can use several mechanisms in order to control gene expression, including acetylation and deacetylation of histone proteins, recruitment of cofactor molecules to the TF-DNA complex, and stabilization or disruption of RNA-DNA interfaces during transcription. They often regulate a group of genes that are involved in similar cellular processes. Thus, genes that contain the same motif in their upstream regions are likely to be related in their functions. In fact, many regulatory motifs are identified by analyzing the regions upstream of genes known to have similar functions.
Motifs have become exceedingly useful for defining genetic regulatory networks and deciphering the functions of individual genes. With our current computational abilities, regulatory motif discovery and analysis has progressed considerably and remains at the forefront of genomic studies.
Challenges of motif discovery
Before we can get into algorithms for motif discovery, we must first understand the characteristics of motifs, especially those that make motifs somewhat difficult to find. As mentioned above, motifs are generally very short, usually only 6-8 base pairs long. Additionally, motifs can be degenerate, where only the nucleotides at certain locations within the motif affect the motif’s function. This degeneracy arises because transcrip- tion factors are free to interact with their corresponding motifs in manners more complex than a simple complementarity relation. As seen in 17.1, many proteins interact with the motif not by opening up the DNA to check for base complementarity, but instead by scanning the spaces, or grooves, between the two sugar phosphate backbones. Depending on the physical structure of the transcription factor, the protein may only be sensitive to the difference between purines and pyrimidines or weak and strong bases, as opposed to identifying specific base pairs. The topology of the transcription factor may even make it such that certain nucleotides aren’t interacted with at all, allowing those bases to act as wildcards.
This issue of degeneracy within a motif poses a challenging problem. If we were only looking for a fixed k-mer, we could simply search for the k-mer in all the sequences we are looking at using local alignment
tools. However, the motif may vary from sequence to sequence. Because of this, a string of nucleotides that is known to be a regulatory motif is said to be an instance of a motif because it represents one of potentially many different combinations of nucleotides that fulfill the function of the motif.
In our approaches, we make two assumptions about the data. First, we assume that there are no pairwise correlations between bases, i.e. that each base is independent of every other base. While such correlations do exist in real life, considering them in our analysis would lead to an exponential growth of the parameter space being considered, and consequently we would run the risk of overfitting our data. The second assumption we make is that all motifs have fixed lengths; indeed, this approximation simplifies the problem greatly. Even with these two assumptions, however, motif finding is still a very challenging problem. The relatively small size of motifs, along with their great variety, makes it fairly difficult to locate them. In addition, a motif’s location relative to the corresponding gene is far from fixed; the motif can be upstream or downstream, and the distance between the gene and the motif also varies. Indeed, sometimes the motif is as far as 10k to 10M base pairs from the gene.
Motifs summarize TF sequence specificity
Because motif instances exhibit great variety, we generally use a Position Weight Matrix (PWM) to char- acterize the motif. This matrix gives the frequency of each base at each location in the motif. The figure below shows an example PWM, where pck corresponds to the frequency of base c in position k within the motif, with pc0 denoting the distribution of bases in non-motif regions.
We now define the problem of motif finding more rigorously. We assume that we are given a set of co-regulated and functionally related genes. Many motifs were previously discovered by doing footprint
experiments, which isolate sequences bound by specific transcription factors, and therefore more likely to correspond to motifs. There are several computational methods that can be used to locate motifs:
- Perform a local alignment across the set of sequences and explore the alignments that resulted in a very high alignment score.
- Model the promoter regions using a Hidden Markov Model and then use a generative model to find non-random sequences.
- Reduce the search space by applying prior knowledge for what motifs should look like.
- Search for conserved blocks between different sequences.
- Examine the frequency of kmers across regions highly likely to contain a motif.
- Use probabilistic methods, such as EM, Gibbs Sampling, or a greedy algorithm
Method 5, using relative kmer frequencies to discover motifs, presents a few challenges to consider. For example, there could be many common words that occur in these regions that are in fact not regulatory motifs but instead different sets of instructions. Furthermore, given a list of words that could be a motif, it is not certain that the most likely motif is the most common word; for instance, while motifs are generally overrepresented in promoter regions, transcription factors may be unable to bind if an excess of motifs are present. One possible solution to this problem might be to find kmers with maximum relative frequency in promoter regions as compared to background regions. This strategy is commonly performed as a post processing step to narrow down the number of possible motifs.
In the next section, we will talk more about these probabilistic algorithms as well as methods to use kmer frequency for motif discovery. We will also come back to the idea of using kmers to find motifs in the context of using evolutionary conservation for motif discovery.