Skip to main content
Biology LibreTexts

9.12: Locating information within DNA

So given that genes exists within a genome, for them to be useful there needs to be mechanisms by which specific genes can be recognized and expressed (transcribed)282. Recognizing genes involves a two-component system consisting of regulatory nucleotide sequences that provide a molecular address that identifies a specific region of a DNA molecule as well as which specific strand of the DNA should be transcribed. The second component are proteins that recognize (and bind to) specific DNA sequences. The regulatory region of a gene can be simple and relatively short or long and complex. In some human genes, the regulatory region is spread over thousands of base-pairs of DNA, located “up-stream” or "down-stream" of and within the coding region283. This is possible because, over a long distance and in association with proteins DNA can fold back on itself.

The proteins that bind to regulatory sequences are known as transcription factors284. In early genetic studies, two general types of mutations were found that could influence the activity of a gene. “cis” mutations are located near the gene’s coding (transcribed) region – these are mutations that alter the regulatory regions of a gene’s. “trans” mutations mapped at other (distant) sites and they turn out to alter the genes that encode the transcription factors involved in the target gene’s regulation. Transcription factors can act either positively to recruit and activate DNA-dependent, RNA polymerase or negatively, to block RNA polymerase binding and activity. It is also possible that post-translational modifications and the binding of allosteric factors can alter the activity of transcription factor. The pattern of transcription factor binding within a regulatory region can also influence whether a gene is expressed or not.

Genes that efficiently recruit and activate RNA polymerase will make many copies of the transcribed RNA and are said to be highly expressed. Generally, high levels of mRNA will lead to high levels of the encoded polypeptide. A mutation in a genes encoding a transcription factor can influence the expression of many genes, while mutations in a gene’s regulatory sequence will directly effect only its own expression, unless of course the gene encodes a transcription factor or its activity influences the regulatory circuitry of the cell.

Transcription regulatory proteins recognize specific DNA sequences by interacting with the edges of base pairs accessible through the major or minor grooves of the DNA helix. There are a number of different types of transcription factors, within structually distinct DNA bonding domains; they can be grouped in various (presumably evolutionarily related)families285. The binding affinity of a particular transcription factorto a particular regulatory sequence will be influenced by the DNA sequence as well as the binding of other proteins in the molecular neighborhood. We can compare affinities of different proteins for different binding sites by using an assay in which short DNA molecules containing a particular nucleotide sequence are mixed in a 1:1 molar ratio, that is, equal numbers of protein and DNA molecules:

DNAsequence + protein ⇆ DNA:protein.

After the binding reaction has reached equilibrium we can measure the percentage of the DNA bound to the protein. If the protein binds with high affinity the value is close to 100%, and close to 0% if it binds with low affinity. In this way we can empirically determine the relative binding specificities (binding affinity for a particular sequence) of various proteins, assuming that we can generate DNA molecules of specific length and sequence (which we can) and purify proteins that remain properly folded in a native rather than denatured or inactive configuration, which may or may not be simple286. What we discover is that transcription factors do not recognize unique nucleotide sequences, but rather have a range of affinities for related sequences. This binding preference is characteristic of each transcription factor protein; it involves both the length of the DNA sequence recognized and the pattern of nucleotides within that sequence. A simple approach to this problem considers the binding information present at each nucleotide position as independent of all others in the binding sequence, which is certainly not accurate but close enough for most situations. This data is often presented as a “sequence logo”287. In such a plot, we indicate the amount of binding information at each position along the length of the binding site. Where there is no preference, that is, where any of the four nucleotides is acceptable, the information present at that site is 0. Where either of two nucleotides are acceptable, the information is 1, and where only one particular nucleotide is acceptable, the information content is 2. Different transcription factor proteins produce different preference plots. As you might predict, mutations in a transcription factor binding site can have dramatically different effects. At sites containing no specific information (0), a mutation will have no effect, whereas in sites of high information (2), any change from the preferred nucleotide will likely produce a severe effect on binding affinity, and can lead to a dramatic change in gene expression.

This is not to say that proteins cannot be extremely specific in their binding to nucleic acid sequences. For example, there is a class of proteins, known as restriction endonucleases and site specific DNA modification enzymes (methylases) that bind to unique nucleotide sequences. For example the restriction endonuclease EcoR1 binds to (and cleaves) the nucleotide sequence GAATTC, change any one of these bases and there is no significant binding and no cleavage. So the fact that transcription factor’s binding specificities are more flexible suggests that there is a reason for such flexibility, although exactly what that reason is remains conjectural.

An important point to take away from this discussion is that most transcription factor proteins also bind to generic DNA sequences with low affinity. This “non-sequence specific” binding is transient and such protein:DNA interactions are rapidly broken by thermal motion. That said, since there are huge numbers of such non-sequence specific binding sites within a cell’s DNA, most of the time transcription factors are found transiently associated with DNA.

To be effective in recruiting a functional RNA polymerase complex to specific sites along a DNA molecule, the binding of a protein to a specific DNA sequence must be relatively long lasting. A common approach to achieving this outcome is for the transcription factor to be multivalent, that is, so that it binds to multiple (typically two) sequence elements. This has the effect that if the transcription factor dissociates from one binding site, it remains tethered to the other; since it is held close to the DNA it is more likely to rebind to its original site. In contrast, a protein with a single binding site is more likely to diffuse away before rebinding can occur. A related behavior involving the low affinity binding of proteins to DNA is that it leads to one-dimensional diffusion along the length of the bound DNA molecule288. This enables a transcription factor protein to bind to DNA and then move back and forth along the DNA molecule until it interacts, and binds to, a high affinity site (or until it dissociates completely.) This type of “facilitated target search” behavior can greatly reduce the time it takes for a protein to find a high affinity binding site among millions of low affinity sites present in the genome289.

As the conditions in which an organism lives get more complex, the more dynamic gene expression needs to be. This is particularly the case in multicellular eukaryotes, where different cell types need to express different genes, or different versions (splice variants) of genes. One approach is to have different gene regulatory regions, that bind different sets of transcription factors. Such regulatory factors not only bind to DNA, they interact with one another. We can imagine that the binding affinity of a particular transcription factor will be influenced by the presence of another transcription factor already bound to an adjacent or overlapping site on the DNA. Similarly the structure of a protein can change when it is bound to DNA, and such a change can lead to interactions with DNA:protein complexes located at more distant sites, known as enhancers. Such regulatory elements, can be part of multiple various regulatory systems.

For example, consider the following situation. Two genes share a common enhancer, depending upon which interaction occurs, gene a or gene b but not both could be active. The end result is that combinations of transcription factors are involved in turning on and off gene expression. In some cases, the same protein can act either positively or negatively, depending upon context, that is, the specific gene regulatory sequences accessible, the other transcription factors expressed, and their various post-translational modifications. Here it is worth noting that the organization of regulatory and coding sequences in DNA imposes directionality on the system. A transcription factor bound to DNA in one orientation or at one position may block the binding of other proteins (or RNA polymerase), while bound to another site it might stabilize protein (RNA polymerase) binding. Similarly, DNA binding proteins can interact with other proteins to control chromatin configurations that can facilitate or block accessibility to regulatory sequences. While it is common to see a particular transcription factor protein labelled a either a transcriptional activator or repressor, in reality the activity of a protein often reflects the specific gene and its interactions with various accessory factors, all of which can influence gene expression.

The exact place where RNA polymerase starts transcribing RNA is known as the transcription start site. Where it falls off the DNA, and so stops transcribing RNA, is known as the transcription termination site. As transcription initiates, the RNA polymerase moves away from the transcription start site. Once the RNA polymerase complex moves far enough away (clears the start site), there is room for another polymerase complex to associate with the DNA, through interactions with transcription factors. Assuming that the regulatory region and its associated factors remains intact, the time to load a new polymerase will be relatively faster than the time it takes to build up a new regulatory complex from scratch. This is one reason that transcription is often found to occur in bursts, a number of RNAs are synthesized from a particular gene in a short time period, followed by a period of transcriptional silence. As mentioned above, a similar bursting behavior is observed in protein synthesis.


282 As an aside, are many transcribed DNA sequences that do not appear to encode a polypeptide or regulatory RNAs. It is not clear whether this transcription is an error, due to molecular level noise or whether such RNAs play a physiological role..

283 Regulatory regions located far from the gene’s transcribed region are known as enhancer elements.

284 In prokaryotes transcription factors are often referred to as sigma (σ) factors.

285 Determining the specificity of protein-DNA interactions:

286 Of course we are assuming that physiologically significant aspect of protein binding involves only the DNA, rather than DNA in the context of chromatin, and ignores the effects of other proteins, but it is a good initial assumption.

287 Sequence logos: a new way to display consensus sequences:

288 As illustrated in the PhET applet:

289 Physics of protein-DNA interactions: mechanisms of facilitated target search:


  • Michael W. Klymkowsky (University of Colorado Boulder) and Melanie M. Cooper (Michigan State University) with significant contributions by Emina Begovic & some editorial assistance of Rebecca Klymkowsky.