The DNA Double Helix and its Replication
In this module, we discuss the replication of DNA—one of the key requirements for a living system to regenerate and create the next generation. Let us first briefly consider the problem through a literary analogy.
The human genome comprises roughly 6.5 billion base pairs of DNA if one considers the full diploid genome (i.e., if you count the DNA inherited from both parents). Six point five billion looks like this: 6,500,000,000. That's a large number. To get a better idea of what that number means, imagine that our DNA is a set of written instructions for constructing one of us. By analogy, we can then compare it to another written document. For this example, we begin by considering Tolstoy's War and Peace, a novel many people are familiar with for its voluminous nature. Data from Wikipedia estimates that War and Peace contains about 560,000 words. A second written work many are familiar with are the seven volumes of J.K. Rowling's Harry Potter. This work checks in at ~1,080,000 words (Referenced Statistics on Wikipedia). If we assume that the length of the average English word is five characters, the two literary works are 2.8 million and 5.4 million characters, respectively. Therefore, even all seven volumes of "Harry Potter" have over 1000x fewer characters than our own genomes. The number of characters in these novels are, however, much closer to the number of nucleotides in a typical bacterial genome.
Now imagine for a moment developing a machine or mechanical process (not an electronic process) that reads and copies these books. Or imagine yourself copying these texts. How fast could you do it? How many mistakes are you likely to make? Do you expect there to be a trade-off between the speed at which you can copy and the accuracy? What type of resources does this process need? How much energy does the process require? Now imagine copying something 1000x larger! Oh, and just for good measure, your imaginary mechanical device needs to do its work on text that is ~25Å wide (i.e., 0.0000000025 meters wide). By comparison, a typical ten point font is ~0.00025 meters wide, about 100,000x larger than the width of a DNA base pair.
With that in mind, it is worth noting that a human cell can take about 24 hours to divide (DNA replication must therefore be a little faster). A healthy E. coli cell may take only 20 minutes to divide (including replicating its ~4.5 million base pair genome). Both the human and bacterium do this while typically making few enough mistakes that the subsequent generation remains viable and recognizable. That should seem rather amazing! Now consider that we estimate the human body comprises ~10 trillion cells (10,000,000,000,000) and that it may have between two and ten times that number of microbial residents. That's a lot of cell division to consider.
If the cell is to replicate—its ultimate goal—a copy of the DNA must be created. So one clear problem statement/question is "how can the cell effectively copy its DNA?" Given the analogy above, here are some relevant sub-questions: What are the chemical and physical properties that enable DNA to be copied? With what fidelity must the organism copy its genome? What speed must it be copied at? Where does the energy come from for this task and how much is necessary? Where do the "raw materials" come from? How do the molecular machines involved in this process couple the assembly of raw materials and the energy required to build a new DNA molecule together? The list could, of course, go on.
In the following discussion and in the lecture, we examine how the process of DNA replication is accomplished while keeping in mind some driving questions. As you go through the reading and lecture materials, try to be constantly aware of these and other questions associated with this process. Use these questions as guideposts for organizing your thoughts and try to find matches between the "facts" that you think we might expect you to know and the driving questions.
The DNA double helix
To build some extra context, we also need a little of empirically determined knowledge. Perhaps one of the best-known and popular features of the hereditary form of the DNA molecule is that it has a double helical tertiary structure. Our appreciation of the double-helical structure of DNA dates to the 1950s. For more on this story, see the short film here.
Models of the structure of DNA revealed that molecule comprises two strands of covalently linked nucleotides that are twisted around each other to form a right-handed helix. In each strand, nucleotides are covalently joined to two other nucleotides (except at the very ends of a linear strand) via phosphodiester bonds that link the sugars via the 5' and 3' hydroxyl groups (panel b in Figure 1). Recall that the labels 5' and 3' refer to the carbons on the sugar molecule. These sugars and phosphate chains form a contiguous set of covalent links that are often referred to as the "backbone" of the structure. In a linear molecule, each strand has two free ends. We call one free end the 5' end because the unlinked functional group that is typically involved in joining nucleotides is the phosphate linked to the 5' carbon. We call the other end of the strand the 3' end because the unlinked functional group that is typically involved in joining nucleotides is the hydroxyl group linked to the 3' carbon of the sugar. Since the two ends of the strand are not symmetrical, this makes it easy to designate a direction one the strand—one can, for instance, say that they are reading from the 5' end to 3' end to indicate that they are "walking" along the strand starting at the 5' end and moving towards the 3' end. This direction (5' to 3') is the convention used by most biologists. One can read in the opposite direction (3' to 5') provided we make the direction explicit. We find the two strands of covalently linked nucleotides to be anti-parallel to one another in the double-helix; that is, the orientation/direction of one strand is opposite to that of the other strand (panel b in Figure 1). The backbone is structurally on the "outside" of the double helix, creating a band of negative charges on the surface. The nitrogenous bases of each of the antiparallel strands stack on the inside of the structure and oppose one another in a way that allows hydrogen bonds between unique purine/pyrimidine pairs (A pairing with T and G pairing with C) to form. We call these specific base pairings complementary base pairs. Thus, we refer to the paired strands of a double helix as complementary strands.
Complementary strands carry redundant information. Because of the strict chemical pairing, if you know the sequence of one strand, you obligatorily know the strand of its complement. Take, for example, the sequence 5′- C A T A T G G G A T G - 3′. Note how the sequence is annotated with the orientation (indicated by 5' and 3' labels). The complement of this sequence—written according to the 5' to 3' convention is: 5′- C A T C C C A T A T G - 3′. If you aren't convinced, write these two sequences out across from one another in your notes, writing them as antiparallel strands. Note that the twisting of the two complementary strands around each other results in the formation of structural features called the major and minor grooves that will become more important when we discuss the binding of proteins to DNA (panel c in Figure 1).
Most of the BIS2A instructors will expect you to recognize key structural features depicted in the figure below and that you will be able to create a basic figure of the structure of DNA yourself.
Figure 1. DNA has (a) a double helix structure and (b) phosphodiester bonds. The (c) major and minor grooves are binding sites for DNA binding proteins during processes such as transcription (the creation of RNA from a DNA template) and replication.
Possible NB Discussion Point
Take a moment to review the nitrogenous bases in Figure 1. Identify functional groups as described in class. For each functional group identified, describe what type of chemistry you expect it to be involved in. Try to identify whether the functional group can act as either a hydrogen bond donor, acceptor, or both?
At around the same time, three hypotheses for the modes of DNA replication were being considered. The models for replication were known as: the conservative model, the semi-conservative model, and the dispersive model.
1. Conservative: The conservative model of replication postulated that each whole double-stranded molecule could act as a template for the synthesis of a new double-stranded molecule. If one were to put a chemical tag on the template DNA molecule after replication, none of that tag would be found on the new copy.
2. Semi-conservative: This hypothesis stipulated that each individual strand of a DNA molecule could serve as a template for a new strand to which it would now associate with. In this case, if a chemical label were placed on a double-stranded DNA molecule, one strand on each of the copies would keep the label.
3. Dispersive: This model proposed that a copied double helix would piecewise combine continuous segments of "old" and "new" strands. If a chemical label were placed on a DNA molecule that were copied using a dispersive mechanism, one would find discrete segments of the resulting copy that were labeled on both strands separated by completely unlabeled parts.
Meselson and Stahl resolved the issue in 1958 when they reported results of a now famous experiment (describe on Wikipedia) which showed that DNA replication is semi-conservative (Figure 2), where each strand is used as a template for the creation of the new strand. To learn more about this experiment, watch The Meselson-Stahl Experiment.
Figure 2. DNA has an antiparallel double helix structure, the nucleotide bases are hydrogen bonded together, and each strand complements the other. DNA is replicated in a semi-conservative manner, each strand is used as the template for the newly made strand.
Having established some basic structural features and the need for a semi-conservative mechanism, it is important to understand some of what we know about the process and to think about what questions one might want to answer if they are to better understand what is going on.
Since DNA replication is a process, we can invoke the energy story rubric to think about it. Recall that the energy story rubric is there to help us think systematically about processes (how things go from A to B). In this case the process in question is the act of starting with one double-stranded DNA molecule and ending up with two double-stranded molecules. So, we will ask a variety of questions: What does the system look like at the beginning (matter and energy) of replication? How are matter and energy transferred in the system, and what catalyzes the transfers? What does the system look like at the end of the process? We can also ask questions regarding specific events that MUST happen during the process. For instance, since DNA is a long molecule and it is sometimes circular, we can ask basic questions like, where does the process of replication start? Where does it end? We can also ask practical questions about the process like, what happens when a double-stranded structure is unwound?
We consider some of these key questions in the text and in class and encourage you to do the same.
Requirements for DNA replication
Let's start by listing some basic functional requirements for DNA replication that we can infer just by thinking about the process that must happen and/or be required for the replication to happen. So, what do we need?
• We know that DNA is composed of nucleotides. If we want to create a new strand, we will need a source of nucleotides.
• We can infer that building a new strand of DNA will require an energy source—we should try to find this.
• We can infer that that there must be a process for finding a place to start replication.
• We can infer that there will be one or more enzymes that help catalyze the process of replication.
• We can also infer that since this is a biochemical process, that it will make some mistakes.
Nucleotide structure review
Recall some basic structural features of the nucleotide building blocks of DNA. The nucleotides start off as nucleotide triphosphates. Nucleotides are composed of a nitrogenous base, deoxyribose (five-carbon sugar), and a phosphate group. We name the nucleotide according to its nitrogenous base, purines such as adenine (A) and guanine (G), or pyrimidines such as cytosine (C) and thymine (T). Recall the structures below. Note that the nucleotide Adenosine triphosphate (ATP) is a precursor of the deoxyribonucleotide (dATP) which is incorporated into DNA.
Figure 3. Each nucleotide is made up of a sugar (ribose or deoxyribose depending on whether it builds RNA or DNA, respectively), a phosphate group, and a nitrogenous base. The purines have a double ring structure with a six-membered ring fused to a five-membered ring. Pyrimidines are smaller in size; they have a single six-membered ring structure. The carbon atoms of the five-carbon sugar are numbered 1', 2', 3', 4', and 5' (1' is read as “one prime”). The phosphate residue is attached to the hydroxyl group of the 5' carbon of one sugar of one nucleotide and the hydroxyl group of the 3' carbon of the sugar of the next nucleotide, thereby forming a 5'-3' phosphodiester bond.
Initiation of replication
Where along the DNA does the replication machinery start DNA replication?
With millions, if not billions, of nucleotides to copy how does the DNA polymerase know where to start? This process turns out not to be random. There are specific nucleotide sequences called origins of replication along the DNA at which replication begins. Once this site is identified, however, there is a problem. The DNA double helix is held together by base stacking interactions and hydrogen bonds. If each strand must be read and copied individually, there must be some mechanism responsible for helping to dissociate the two strands from one another. Energetically, this is an endergonic process. Where does the energy come from, and how is this reaction catalyzed? Basic reasoning should lead to the hypothesis that a protein catalyst is likely involved, and that this enzyme either creates new bonds that are energetically more favorable (exergonic) than the bonds it breaks AND/OR it can couple the use of an external energy source to help dissociate the strands.
It turns out that the details of this process and the proteins involved differ depending on the specific organism in question, and many of the molecular level details are not completely understood. There are, however, some common features in the replication of eukaryotes, bacteria, and archaea, and one of these features is that the process involves multiple different types of proteins in replicating DNA. First, proteins called "initiators" can bind DNA at or very near origins of replication. The interaction of the initiator proteins with the DNA helps to destabilize the double helix and also help to recruit other proteins, including an enzyme called a DNA helicase to the DNA. Here the energy required to destabilize the DNA double helix seems to come from the formation of new associations between DNA and the initiator proteins and the proteins themselves. The DNA helicase is a multi-subunit protein important in the process of replication because it couples the exergonic hydrolysis of ATP to the unwinding of the DNA double helix. Additional proteins must be recruited to the initiation complex (the collection of proteins involved in initiating transcription). These include, but are not limited to, additional enzymes called primase and DNA polymerase. While the initiators depart soon after the initiation of replication, the rest of the proteins work in concert to execute the process of DNA replication. This complex of enzymes function at Y-shaped structures in the DNA called replication forks (Figure 4). For any replication event, two replication forks can form at each origin of replication, extending in both directions. Multiple origins of replication can be found on eukaryotic chromosomes and some archaea, while the genome of the bacterium, E. coli, seems to encode one origin of replication.
Figure 4. At the origin of replication, a replication bubble forms. The replication bubble is composed of two replication forks, each traveling in opposite directions along the DNA. It is understood that the replication forks include all the enzymes required for replication to occur—they are just not drawn explicitly in the figure to provide room to illustrate the relationships between the template and new DNA strands.
Attribution: BIS2A team original image
Elongation of replication
The melting open of the DNA double helix and the assembling the DNA replication complex is just the first step in the process of replication. Now the process of creating a new strand actually needs to get started. Here, we encounter additional challenges. The first obvious issue is that of determining which of the two strands should get copied at any replication fork (i.e., Which strand will serve as a template for semi-conservative synthesis? Are both strands equally viable alternatives?). There is also the problem of getting the process of the new strand synthesis started. Can the DNA polymerase start the new strand on its own? We will discuss later the answer to the latter question and some rationale and consequences. The key idea to note at this point is that it has been experimentally determined that DNA polymerase can NOT start strand synthesis on its own. Rather, DNA polymerase requires a short stretch of double-stranded structure followed by a single-stranded template. The enzyme primase creates a short oligonucleotide polymer of RNA (not DNA) called a primer (these are depicted by short green lines in the figures above and below). DNA polymerase uses the primer to nucleate and grow a new strand.
During the process of strand elongation, the DNA polymerase polymerizes a new covalently linked strand of DNA nucleotides (in bacteria this specific enzyme may be called DNA polymerase III; in eukaryotes, polymerase nomenclature is more complex and the roles of several polymerase proteins are not completely understood). It turns out that one strand is favored over the other to serve as a template. DNA polymerase will "read" the template strand from 3' to 5' and synthesize a new strand in the 5' to 3' direction. Hypotheses to explain this universal observation usually center on the energetics associated with the addition of a new nucleotide and arguments associated with DNA repair that we will describe shortly. Let us, therefore, briefly consider the reaction involving the addition of a single nucleotide. The primer provides an important 3' hydroxyl on which to begin synthesis. The next deoxyribonucleotide triphosphate enters the binding site of the DNA polymerase and, as shown in Figure 5 below, is oriented by the polymerase such that a hydrolysis of the 5' triphosphate can occur. This reaction releases pyrophosphate and couples the exergonic hydrolysis of the phosphoanhydride to the synthesis of a phosphodiester bond between the 5' phosphate of the incoming nucleotide and the 3' hydroxyl group of the primer. This process repeats until deoxyribonucleotide triphosphates run out or the replication complex falls off of the DNA. In effect, DNA polymerase adds the phosphate group (5') from the incoming nucleotide to the existing hydroxyl group (3') of the previously added nucleotide.
Correct base pairing, or selection of correct nucleotide to add at each step, is accomplished by structural constraints felt by the DNA polymerase and the energetically favorable hydrogen bonds formed between complementary nucleotides. The process is energetically driven by the hydrolysis of the incoming 5' triphosphate and the energetically favorable interactions formed by the inter-nucleotide interactions in the growing double helix (base stacking and complementary base pairing hydrogen bonds). Note that the energetics of nucleotide addition do not technically prevent a strand growing in the 3' to 5' direction. The key difference in this “backwards” synthesis scheme is that the energy “source” for synthesis would need to come from a nucleotide already incorporated into the growing strand rather than the new incoming nucleotide (which this might be an important selective disadvantage is discussed briefly). After elongation has started a different DNA polymerase (in bacteria we usually call this enzyme DNA Polymerase I) comes in to remove the RNA primer and to synthesize the remaining bit of missing DNA.
As discussed in more detail in class, the movement of the replication fork induces winding of the DNA in both directions of replication. Another ATP consuming enzyme called topoisomerase helps to relieve this stress.
Figure 5. DNA polymerase catalyzes the addition of the 5' phosphate group from an incoming nucleotide to the 3' hydroxyl group of the previous nucleotide. This process creates a phosphodiester bond between the nucleotides while hydrolyzing the phosphoanhydride bond in the nucleotide.
Leading and lagging strand
The discussion above about strand elongation describes the process of new strand synthesis if that strand is synthesized in the same direction as the replication fork is or appears to be moving along the DNA. This strand can be synthesized continuously and is called the leading strand. However, both strands of the original DNA double helix must be copied. Since the DNA polymerase can only synthesize DNA in a 5' to 3' direction, the polymerization of the strand opposite of the leading strand must occur in the opposite direction that helicase, or front of the replication fork, is traveling. This strand is called the lagging strand, and due to geometric constraints, must be synthesized through a series of RNA priming and DNA synthesis events into short segments called Okazaki fragments. As noted, the initiation of synthesis of each Okazaki fragment requires a primase to synthesize an RNA primer, and each of these RNA primers must be removed and replaced with DNA nucleotides by a different DNA polymerase. The covalent bonds between each of the Okazaki fragment can not be made by the DNA polymerase and must therefore be formed by yet another enzyme called DNA ligase. The geometry of lagging strand synthesis is difficult to visualize and will be covered in class.
Figure 6. The lagging strand is created in multiple segments. A replication fork shows the leading and lagging strand. A replication bubble shows the leading and lagging strands.
BIS2A Team original image