# Lecture 12: DNA structure, replication & recombination Central Dogma

## The scope of the problem

In this module we discuss the replication of DNA—one of the key requirements for a living system to regenerate and create the next generation. Let us first briefly consider the scope of the problem by way of a literary analogy.

The human genome consists of roughly 6.5 billion base pairs of DNA if one considers the full diploid genome (i.e., if you count the DNA inherited from both parents). Six point five billion looks like this: 6,500,000,000. That's a large number. To get a better idea of what that number means, imagine that our DNA is a set of written instructions for constructing one of us. By analogy we can then compare it to another written document. For this example we begin by considering Tolstoy's War and Peace, a novel many people are familiar with for its voluminous nature. Data from Wikipedia estimates that War and Peace contains about 560,000 words. A second written work many are familiar with are the seven volumes of J.K. Rowling's Harry Potter. This work checks in at ~1,080,000 words (Referenced Statistics on Wikipedia). If we assume that the length of the average English word is five characters, the two literary works are 2.8 million and 5.4 million characters in length, respectively. Therefore, even all seven volumes of "Harry Potter" have over 1000x fewer characters than our own genomes. The number of characters in these novels are, however, much closer to the number of nucleotides in a typical bacterial genome.

Now imagine for a moment developing a machine or mechanical process (not an electronic process) that is responsible for reading and copying these books. Or imagine yourself copying these texts. How fast could you do it? How many mistakes are you likely to make? Do you expect there to be a trade-off between the speed at which you can copy and the accuracy? What type of resources does this process need? How much energy is required? Now imagine copying something 1000x larger! Oh, and just for good measure, your imaginary mechanical device needs to do its work on text that is ~25Å wide (i.e., 0.0000000025 meters wide). By comparison, a typical ten point font is ~0.00025 meters wide, about 100,000x larger than with width of a DNA base pair.

With that in mind, it is worth noting that a human cell can take about 24 hours to divide (DNA replication must therefore be a little faster). A healthy E. coli cell may take only 20 minutes to divide (including replicating its ~4.5 million base pair genome). Both the human and bacterium do this while typically making few enough mistakes that the subsequent generation remains viable and recognizable. That should seem rather amazing! Now consider that the human body is estimated to consist of ~10 trillion cells (10,000,000,000,000) and that it may have between two and ten times that number of microbial residents and that's a lot of cell division to consider.

Design challenge

If the cell is to replicate—its ultimate goal—a copy of the DNA must be created. So one clear problem statement/question is "how can the cell effectively copy its DNA?" Given the analogy above, here are some relevant subquestions: What are the chemical and physical properties that enable DNA to be copied? With what fidelity must the DNA be copied? What speed must it be copied at? Where does the energy come from for this task and how much is necessary? Where do the "raw materials" come from? How do the molecular machines involved in this process couple the assembly of raw materials and the energy required to build a new DNA molecule together? The list could, of course, go on.

In the following discussion and in lecture we will be interested in starting to examine how the process of DNA replication is accomplished while keeping in mind some of the driving questions. As you go through the reading and lecture materials, try to constantly be aware of these and other questions associated with this process. Use these questions as guideposts for organizing your thoughts and try to find matches between the "facts" that you think you might be expected to know and the driving questions.

## The DNA double helix

To build some extra context we also need a little bit of empirically determined knowledge. Perhaps one of the best known and popular features of the hereditary form of the DNA molecule is that it has a double helical tertiary structure. The appreciation of this dates to the the 1950s. The story of this discovery has been widely recounted—and the details are beyond the scope of this text. Briefly, Francis Crick and James Watson are credited with determining the structure of DNA. Rosalind Franklin is now also widely credited with generating critical X-ray diffraction data that enabled Watson and Crick to piece together the puzzle of the DNA molecule.

Complementary strands carry redundant information. Because of the strict chemical pairing, if you know the sequence of one strand you obligatorily know the strand of its complement. Take for example the sequence 5′- C A T A T G G G A T G - 3′. Note how the sequence is annotated with the orientation (indicated by 5' and 3' labels). The complement of this sequence—written according to the 5' to 3' convention is: 5′- C A T C C C A T A T G - 3′. If you aren't convinced, write these two sequences out across from one another in your notes, making sure to write them as antiparallel strands. Note that the twisting of the two complementary strands around each other results in the formation of structural features called the major and minor grooves that will become more important when we discuss the binding of proteins to DNA (panel c in Figure 1).

Most of the BIS2A instructors will expect you to recognize key structural features depicted in the figure below and that you will be able to create a basic figure of the structure of DNA yourself.

Figure 1DNA has (a) a double helix structure and (b) phosphodiester bonds. The (c) major and minor grooves are binding sites for DNA binding proteins during processes such as transcription (the creation of RNA from a DNA template) and replication.

At around the same time, three hypotheses for the modes of DNA replication were being considered. The models for replication were known as: the conservative model, the semi-conservative model, and the dispersive model.

1. Conservative: The conservative model of replication postulated that each whole double-stranded molecule could act as a template for the synthesis of a completely new double-stranded molecule. That is, if one were to put a chemical tag on the template DNA molecule after replication none of that tag would be found on the new copy.
2. Semi-conservative: This hypothesis stipulated that each individual strand of a DNA molecule could serve as a template for a new strand to which it would now associate with. in this case, if a chemical label were placed on a double-stranded DNA molecule, one strand on each of the copies would retain the label.
3. Dispersive: This model proposed that a copied double helix would be a piecewise combination of continuous segments of "old" and "new" strands. If a chemical label were placed on a DNA molecule that were copied using a dispersive mechanism, one would find discrete segments of the resulting copy that were labeled on both strands separated by completely unlabeled parts.

Meselson and Stahl resolved the issue in 1958 when they reported results of a now famous experiment (describe on Wikipedia) which showed that DNA replication is semi-conservative (Figure 2), where each strand is used as a template for the creation of the new strand. To learn more about this experiment watch The Meselson-Stahl Experiment.

Figure 2. DNA has an antiparallel double helix structure, the nucleotide bases are hydrogen bonded together and each strand complements the other. DNA is replicated in a semi-conservative manner, each strand is used as the template for the newly made strand.

## DNA replication

Having established some basic structural features and the need for a semi-conservative mechanism, it is important to begin understanding some of what is known about the process and to think about what questions one might want to answer if they are to better understand what is going on.

Since DNA replication is a process, we can invoke the energy story rubric to think about it. Recall that the energy story rubric is there to help us think systematically about processes (how things go from A to B). In this case the process in question is the act of starting with one double-stranded DNA molecule and ending up with two double-stranded molecules. So, we will ask a variety of questions: What does the system look like at the beginning (matter and energy) of replication? How are matter and energy transferred in the system and what catalyzes the transfers? What does the system look like at the end of the process? We can also ask questions regarding specific events that MUST happen during the process. For instance, since DNA is a long molecule and it is sometimes circular, we can ask basic questions like, where does the process of replication start? Where does it end? We can also ask practical questions about the process like, what happens when a double-stranded structure is unwound?

We consider some of these key questions in the text and in class and encourage you to do the same.

### Requirements for DNA replication

Let's start by listing some basic functional requirements for DNA replication that we can infer just by thinking about the process that must happen and/or be required for the replication to happen. So, what do we need?

• We know that DNA is composed of nucleotides. If we are going to create a new strand, we will need a source of nucleotides.
• We can infer that building a new strand of DNA will require an energy source—we should try to find this.
• We can infer that that there must be a process for finding a place to start replication.
• We can infer that there will be one or more enzymes that help catalyze the process of replication.
• We can also infer that since this is a biochemical process, that some mistakes will be made.

### Nucleotide structure review

Recall some basic structural features of the nucleotide building blocks of DNA. The nucleotides start off as nucleotide triphosphates. A nucleotide are composed of a nitrogenous base, deoxyribose (five-carbon sugar), and a phosphate group. The nucleotide is named according to its nitrogenous base, purines such as adenine (A) and guanine (G), or pyrimidines such as cytosine (C) and thymine (T). Recall the structures below. Note that the nucleotide Adenosine triphosphate (ATP) is a precursor of the deoxyribonucleotide (dATP) which is incorporated into DNA.

Figure 3. Each nucleotide is made up of a sugar (ribose or deoxyribose depending on whether it builds RNA or DNA, respectively), a phosphate group, and a nitrogenous base. The purines have a double ring structure with a six-membered ring fused to a five-membered ring. Pyrimidines are smaller in size; they have a single six-membered ring structure. The carbon atoms of the five-carbon sugar are numbered 1', 2', 3', 4', and 5' (1' is read as “one prime”). The phosphate residue is attached to the hydroxyl group of the 5' carbon of one sugar of one nucleotide and the hydroxyl group of the 3' carbon of the sugar of the next nucleotide, thereby forming a 5'-3' phosphodiester bond.

## Initiation of replication

### Where along the DNA does the replication machinery start DNA replication?

With millions, if not billions, of nucleotides to copy how does the DNA polymerase know where to start? Not surprisingly, this process turns out not to be random. There are specific nucleotide sequences called origins of replication along the length of the DNA at which replication begins. Once this site is identified, however, there is a problem. The DNA double helix is held together by base stacking interactions and hydrogen bonds. If each strand is to be read and copied individually, there must be some mechanism responsible for helping to dissociate the two strands from one another. Energetically, this is an endergonic process. Where does the energy come from and how is this reaction catalyzed? Basic reasoning should, at this point, lead to the hypothesis that a protein catalyst is likely involved, and this enzyme either creates new bonds that are energetically more favorable (exergonic) than the bonds it breaks AND/OR it is able to couple the use of an external energy source to help dissociate the strands.

It turns out that the details of this process and the proteins involved differ depending on the specific organism in question, and many of the molecular level details are yet to be completely understood. There are, however, some common features in the replication of eukaryotes, bacteria, and archaea, and one of these features is that multiple different types of proteins are involved in replicating DNA. First, proteins generally called "initiators" have the capacity to bind DNA at or very near origins of replication. The interaction of the initiator proteins with the DNA helps to destabilize the double helix and also help to recruit other proteins, including an enzyme called a  DNA helicase to the DNA. In this case the energy required to destabilize the DNA double helix seems to come from the formation of new associations between DNA and the initiator proteins and the proteins themselves. The DNA helicase is a multi-subunit protein that is important in the process of replication because it couples the exergonic hydrolysis of ATP to the unwinding of the DNA double helix. Additional proteins must be recruited to the initiation complex (the collection of proteins involved in initiating transcription). These include, but are not limited to, additional enzymes called primase and DNA polymerase. While the initiators are lost soon after the initiation of replication, the rest of the proteins work in concert to execute the process of DNA replication. This complex of enzymes function at Y-shaped structures in the DNA called replication forks (Figure 4). For any replication event two replication forks may be formed at each origin of replication, extending in both directions. Multiple origins of replication can be found on eukaryotic chromosomes and some archaea, while the the genome of the bacterium, E. coli, seems to encode one origin of replication.

Note: possible discussion

Why would different organisms have different numbers of replication origins? What could the benefit be to having more than one? Is there a drawback to having more than one?

Note: possible discussion

Given what needs to happen at origins of replication, can you use logic to infer and propose for discussion some potential features that distinguish replication origins from other segments of DNA?

Figure 4At the origin of replication, a replication bubble forms. The replication bubble is composed of two replication forks, each traveling in opposite directions along the DNA. It is understood that the replication forks include all of the enzymes required for replication to occurthey are just not drawn explicitly in the figure to provide room to illustrate the relationships between the template and new DNA strands.

### Elongation of replication

The melting open of the DNA double helix and the assembling the DNA replication complex is just the first step in the process of replication. Now the process of creating a new strand actually needs to get started. Here additional challenges are encountered. The first obvious issue is that of determining which of the two strands should be copied at any replication fork (i.e., Which strand will serve as a template for semi-conservative synthesis? Are both strands equally viable alternatives?). There is also the problem of actually getting the process of the new strand synthesis started. Can the DNA polymerase initiate the new strand on its own? The answer to the latter question, and some of the rationale and consequences, will be discussed later. The key idea to note at this point is that it has been experimentally determined that DNA polymerase can NOT initiate strand synthesis on its own. Rather, DNA polymerase requires a short stretch of double-stranded structure followed by single-stranded template. The creation of a short oligonucleotide is carried out by the enzyme primase. This protein creates a short polymer of RNA (not DNA) called a primer (these are depicted by short green lines in the figures above and below) that can be used by DNA polymerase to nucleate a new growing strand.

During the process of strand elongation, the DNA polymerase polymerizes a new covalently-linked strand of DNA nucleotides (in bacteria this specific enzyme may be called DNA polymerase III; in eukaryotes, polymerase nomenclature is more complex and the roles of several polymerase proteins are not completely understood). It turns out that one of the strands is favored exclusively over the other to serve as a template. DNA polymerase will "read" the template strand from 3' to 5' and synthesize a new strand in the 5' to 3' direction. Hypotheses to explain this universal observation usually center around the energetics associated with the addition of a new nucleotide and arguments associated with DNA repair that we will describe shortly. Let us, therefore, briefly consider the reaction involving the addition of a single nucleotide. The primer provides an important 3' hydroxyl on which to begin synthesis. The next deoxyribonucleotide triphosphate enters the binding site of the DNA polymerase and, as shown in Figure 5 below, is oriented by the polymerase such that a hydrolysis of the 5' triphosphate can occur, releasing pyrophosphate and coupling this exergonic reaction to the synthesis of a phosphodiester bond between the 5' phosphate of the incoming nucleotide and the 3' hydroxyl group of the primer. This process can be repeated until deoxyribonucleotide triphosphates run out or the replication complex falls off of the DNA. In effect, DNA polymerase adds the phosphate group (5') from the incoming nucleotide to the existing hydroxyl group (3') of the previously added nucleotide.

Correct base pairing, or selection of correct nucleotide to add at each step, is accomplished by structural constraints felt by the DNA polymerase and the energetically favorable hydrogen bonds formed between complementary nucleotides. The process is energetically driven by the hydrolysis of the incoming 5' triphosphate and the energetically favorable interactions formed by the inter-nucleotide interactions in the growing double helix (base stacking and complementary base pairing hydrogen bonds). Note that the energetics of nucleotide addition do not technically preclude a strand growing in the 3' to 5' direction, the key difference in this scheme is that the energy "source" for synthesis would need to come from a nucleotide already incorporated into the growing strand rather than the new incoming nucleotide (which this might be an important selective disadvantage is discussed briefly). After elongation has started a different DNA polymerase (in bacteria this is usually called DNA Polymerase I) comes in to remove the RNA primer and to synthesize the remaining bit of missing DNA.

As will be discussed in more detail in class, the movement of the replication fork induces winding of the DNA in both directions of replication. Another ATP consuming enzyme called topoisomerase helps to relieve this stress.

Figure 5. DNA polymerase catalyzes the addition of the 5' phosphate group from an incoming nucleotide to the 3' hydroxyl group of the previous nucleotide. This process creates a phosphodiester bond between the nucleotides while hydrolyzing the phosphoanhydride bond in the nucleotide.
Source: http://bio1151.nicerweb.com/Locked/m...h16/elong.html

Note: possible discussion

Create an energy story for the addition of a nucleotide onto a polymer as shown in the figure above. This will be an explicit learning goal from some of your BIS2A instructors.

The discussion above about strand elongation describes the process of new strand synthesis if that strand happens to be synthesized in the same direction as the replication fork is or appears to be moving along the DNA. This strand can be synthesized continuously and is called the leading strand. However, both strands of the original DNA double helix must be copied. Since the DNA polymerase can only synthesize DNA in a 5' to 3' direction, the polymerization of the strand opposite of the leading strand must occur in the opposite direction that helicase, or front of the replication fork, is traveling. This strand is called the lagging strand, and due to geometric constraints, must be synthesized through a series of RNA priming and DNA synthesis events into short segments called Okazaki fragments. As noted, the initiation of synthesis of each Okazaki fragment requires a primase to synthesize an RNA primer, and each of these RNA primers must be ultimately removed and replaced with DNA nucleotides by a different DNA polymerase. The covalent bonds between each of the Okazaki fragment can not be made by the DNA polymerase and must therefore be formed by yet another enzyme called DNA ligase. The geometry of lagging strand synthesis is difficult to visualize and will be covered in class.

Figure 6. The lagging strand is created in multiple segments. A replication fork shows the leading and lagging strand. A replication bubble shows the leading and lagging strands.
BIS2A Team original image

## Termination of replication

### Telomeres and telomerase

The ends of replication in circular bacterial chromosomes poses few practical problems. However, the ends of linear eukaryotic chromosomes pose a specific problem for DNA replication. Because DNA polymerase can add nucleotides in only one direction (5' to 3'), the leading strand allows for continuous synthesis until the end of the chromosome is reached; however, as the replication complex arrives at the end of the lagging strand there is no place for the primase to "land" and synthesize an RNA primer so that the synthesis of the missing lagging strand DNA fragment at the end of the chromosome can be initiated by the DNA polymerase. Without some mechanism to help fill this gap, this chromosomal end will remain unpaired and will be lost to nucleases. Over time, and several rounds of replication, this would result in the ends of linear chromosomes getting progressively shorter, ultimately compromising the ability of the organism to survive. These ends of the linear chromosomes are known as telomeres, and nearly all eukaryotic species have evolved repetitive sequences that do not code for a specific gene. As a consequence, these "non-coding" telomeres act as replication buffers and are shortened with each round of DNA replication instead of critical genes. For example, in humans, a six base-pair sequence, TTAGGG, is repeated 100 to 1000 times at the end of most chromosomes. In addition to acting as a potential buffer, the discovery of the enzyme telomerase helped in the understanding of how chromosome ends are maintained. Telomerase is an enzyme composed of protein and RNA. Telomerase attaches to the end of the chromosome by complementary base pairing between the RNA component of telomerase and the DNA template. The RNA is used as a complementary strand for the short elongation of its complement. This process can be repeated numerous times. Once the lagging strand template is sufficiently elongated by telomerase, primase will create a primer followed by DNA polymerase which can now add nucleotides that are complementary to the ends of the chromosomes. Thus, the ends of the chromosomes are replicated.

Figure 7. The ends of linear chromosomes are maintained by the action of the telomerase enzyme.

Telomerase is not active in adult somatic cells. Adult somatic cells that undergo cell division continue to have their telomeres shortened. This essentially means that telomere shortening is associated with aging. In 2010, scientists found that telomerase can reverse some age-related conditions in mice, and this may have potential in regenerative medicine.1 Telomerase-deficient mice were used in these studies; these mice have tissue atrophy, stem-cell depletion, organ system failure, and impaired tissue injury responses. Telomerase reactivation in these mice caused extension of telomeres, reduced DNA damage, reversed neurodegeneration, and improved functioning of the testes, spleen, and intestines. Thus, telomere reactivation may have potential for treating age-related diseases in humans.

## Differences in DNA replication rates between bacteria and eukaryotes

DNA replication has been extremely well studied in bacteria, primarily because of the small size of the genome and large number of variants available. E. coli has 4.6 million base pairs in a single circular chromosome, and all of it gets replicated in approximately 42 minutes, starting from a single origin of replication and proceeding around the chromosome in both directions. This means that approximately 1000 nucleotides are added per second. The process is much more rapid than in eukaryotes. Table 1 summarizes the differences between bacterial and eukaryotic replications.

Table 1. Differences between prokaryotic and eukaryotic replication

Differences between prokaryotic and eukaryotic replication
Property Prokaryotes Eukaryotes
Origin of replication Single Multiple
Rate of polymerization per polymerase 1000 nucleotides/s 50 to 100 nucleotides/s
Chromosome structure Circular Linear
Telomerase Not present Present

Click through a tutorial on DNA replication.

When the cell begins the task of replicating the DNA, it does so in response to environmental signals that tell the cell it is time to divide. The ideal goal of DNA replication is to produce two identical copies of the double-stranded DNA template and to do it in an amount of time that does not pose an unduly high evolutionarily selective cost. This is a daunting task when you consider that there are ~6,500,000,000 base pairs in the human genome and ~4,500,000 base pairs in the genome of a typical E. coli strain and that Nature has determined that the cells must replicate within 24 hours and 20 minutes, respectively. In either case, many individual biochemical reactions need to take place.

While ideally replication would happen with perfect fidelity, DNA replication, like all other biochemical processes, is imperfect—bases may be left out, extra bases may be added, or bases may be added that do not properly base-pair. In many organisms, many of the mistakes that occur during DNA replication are promptly corrected by DNA polymerase itself via a mechanism known as proofreading. In proofreading, the DNA polymerase "reads" each newly added base via sensing the presence or absence of small structural anomalies before adding the next base to the growing strand. In doing so, a correction can be made.

If the polymerase detects that a newly added base has paired correctly with the base in the template strand, the next nucleotide is added. If, however, a wrong nucleotide is added to the growing polymer, the misshaped double helix will cause the DNA polymerase to stall, and the newly made strand will be ejected from the polymerizing site on the polymerase and will enter into an exonuclease site. In this site, DNA polymerase is able to cleave off the last several nucleotides that were added to the polymer. Once the incorrect nucleotides have been removed, new ones will be added again. This proofreading capability comes with some trade-offs: using an error-correcting/more accurate polymerase requires time (the trade-off is speed of replication) and energy (always an important cost to consider). The slower you go, the more accurate you can be. Going too slow, however, may keep you from replicating as fast as your competition, so figuring out the balance is key.

Errors that are not corrected by proofreading become what are known as mutations.

Figure 1. Proofreading by DNA polymerase corrects errors during replication.

Suggested discussion

Why would DNA replication need to be fast? Consider the environment the DNA is in, and compare that to the structure of DNA while being replicated.

Suggested discussion

What are the pros and cons of DNA polymerase's proofreading capabilities?

## Replication mistakes and DNA repair

Although DNA replication is typically a highly accurate process, and proofreading DNA polymerases helps to keep the error rate low, mistakes still occur. In addition to errors of replication, environmental damage may also occur to the DNA. Such uncorrected errors of replication or environmental DNA damage may lead to serious consequences. Therefore, Nature has evolved several mechanisms for repairing damaged or incorrectly synthesized DNA.

### Mismatch repair

Some errors are not corrected during replication but are instead corrected after replication is completed; this type of repair is known as a mismatch repair. Specific enzymes recognize the incorrectly added nucleotide and excise it, replacing it with the correct base. But, how do mismatch repair enzymes recognize which of the two bases is the incorrect one?

In E. coli, after replication, the nitrogenous base adenine acquires a methyl group; this means that directly after replication the parental DNA strand will have methyl groups, whereas the newly synthesized strand lacks them. Thus, mismatch repair enzymes are able to scan the DNA and remove the wrongly incorporated bases from the newly synthesized, non-methylated strand by using the methylated strand as the "correct" template from which to incorporate a new nucleotide. In eukaryotes, the mechanism is not as well understood, but it is believed to involve recognition of unsealed nicks in the new strand, as well as a short-term, continuing association of some of the replication proteins with the new daughter strand after replication has completed.

Figure 2. In mismatch repair, the incorrectly added base is detected after replication. The mismatch repair proteins detect this base and remove it from the newly synthesized strand by nuclease action. The gap is now filled with the correctly paired base.

### Nucleotide excision repair

Nucleotide excision repair enzymes replace incorrect bases by making a cut on both the 3' and 5' ends of the incorrect base. The entire segment of DNA is removed and replaced with correctly paired nucleotides by the action of a DNA polymerase. Once the bases are filled in, the remaining gap is sealed with a phosphodiester linkage catalyzed by the enzyme DNA ligase. This repair mechanism is often employed when UV exposure causes the formation of pyrimidine dimers.

Figure 3. Nucleotide excision repairs thymine dimers. When exposed to UV, thymines lying adjacent to each other can form thymine dimers. In normal cells, they are excised and replaced.

## Consequences of errors in replication, transcription, and translation

Cells have evolved a variety of ways to make sure DNA errors are both detected and corrected. We have already discussed several of them. But why did so many different mechanisms evolve? From proofreading by the various DNA-dependent DNA polymerases, to the complex repair systems. Such mechanisms did not evolve for errors in transcription or translation. If you are familiar with the processes of transcription and/or translation, think about what the consequences would be of an error in transcription. Would such an error affect the offspring? Would it be lethal to the cell? What about errors in translation? Ask the same questions about the process of translation. What would happen if the wrong amino acid is accidentally put into the growing polypeptide during translation? How do these contrast with DNA replication? If you are not familiar with transcription or translation, don't fret. We'll learn those soon and return to this question again.

## The flow of genetic information

In bacteria, archaea, and eukaryotes, the primary role of DNA is to store heritable information that encodes the instruction set required for creating the organism in question. While we have gotten much better at quickly reading the chemical composition (the sequence of nucleotides in a genome and some of the chemical modifications that are made to it), we still don't know how to reliably decode all of the information encoded within and all of the mechanisms by which it is read and ultimately expressed.

There are, however, some core principles and mechanisms associated with the reading and expression of the genetic code whose basic steps (even though many details remain unsolved) are understood and that need to be part of the conceptual toolkit for all biologists. Two of these processes are transcription and translation, which are the coping of parts of the genetic code written in DNA into molecules of the related polymer RNA and the reading and encoding of the RNA code into proteins, respectively.

In BIS2A, we focus largely on developing an understanding of the process of transcription (recall that an Energy Story is simply a rubric for describing a process) and its role in the expression of genetic information. We motivate our discussion of transcription by focusing on functional problems (bringing in parts of our problem solving/design challenge rubric) that must be solved the the process to take place. We then go on to describe how the process is used by Nature to create a variety of functional RNA molecules (that may have various structural, catalytic or regulatory roles) including so called messenger RNA (mRNA) molecules that carry the information required to synthesize proteins. Likewise, we focus on challenges and questions associated with the process of translation, the process by which the ribosomes synthesize proteins.

The basic flow of genetic information in biological systems is often depicted in a scheme known as "the central dogma" (see figure below). This scheme states that information encoded in DNA flows into RNA via transcription and ultimately to proteins via translation. Processes like reverse transcription (the creation of DNA from and RNA template) and replication also represent mechanisms for propagating information in different forms. This scheme, however, doesn't say anything per se about how information is encoded or about the mechanisms by which regulatory signals move between the various layers of molecule types depicted in the model. Therefore, while the scheme below is a nearly required part of the lexicon of any biologist, perhaps left over from old tradition, students should also be aware that mechanisms of information flow are more complex (we'll learn about some as we go, and that "the central dogma" only represents some core pathways).

Figure 1The flow of genetic information.
Attribution: Marc T. Facciotti (original work)

## Genotype to phenotype

An important concept in the following sections is the relationship between genetic information, the genotype, and the result of expressing it, the phenotype. These two terms and the mechanisms that link the two will be discussed repeatedly over the next few weeks—start becoming proficient with using this vocabulary.

Figure 2. The information stored in DNA is in the sequence of the individual nucleotides when read from 5' to 3' direction. Conversion of the information from DNA into RNA (a process called transcription) produces the second form that information takes in the cell. The mRNA is used as the template for the creation of the amino acid sequence of proteins (in translation). Here, two different sets of information are shown. The DNA sequence is slightly different, resulting in two different mRNAs produced, followed by two different proteins, and ultimately, two different coat colors for the mice.

Genotype refers to the information stored in the DNA of the organism, the sequence of the nucleotides, and the compilation of its genes. Phenotype refers to any physical characteristic that you can measure, such as height, weight, amount of ATP produced, ability to metabolize lactose, response to environmental stimuli, etc. Differences in genotype, even slight, can lead to different phenotypes that are subject to natural selection. The figure above depicts this idea. Also note that, while classic discussions of the genotype and phenotype relationships are talked about in the context of multicellular organisms, this nomenclature and the underlying concepts apply to all organisms, even single-celled organisms like bacteria and archaea.

Note: possible discussion

Can something you can not see "by eye" be considered a phenotype?

Note: possible discussion

Can single-celled organisms have multiple simultaneous phenotypes? If so, can you propose an example? If not, why?

## Genes

What is a gene? A gene is a segment of DNA in an organism's genome that encodes a functional RNA (such as rRNA, tRNA, etc.) or protein product (enzymes, tubulin, etc.). A generic gene contains elements encoding regulatory regions and a region encoding a transcribed unit.

Genes can acquire mutations—defined as changes in the in the composition and or sequence of the nucleotides—either in the coding or regulatory regions. These mutations can lead to several possible outcomes: (1) nothing measurable happens as a result; (2) the gene is no longer expressed; or (3) the expression or behavior of the gene product(s) are different. In a population of organisms sharing the same gene different variants of the gene are known as alleles. Different alleles can lead to differences in phenotypes of individuals and contribute to the diversity in biology that is under selective pressure.

Start learning these vocabulary terms and associated concepts. You will then be somewhat familiar with them when we start diving into them in more detail over the next lectures.

Figure 3. A gene consists of a coding region for an RNA or protein product accompanied by its regulatory regions. The coding region is transcribed into RNA which is then translated into protein.

Genomics

The study of nucleic acids began with the discovery of DNA, progressed to the study of genes and small fragments, and has now exploded to the field of genomics. Genomics is the study of entire genomes, including the complete set of genes, their nucleotide sequence and organization, and their interactions both within a species and with other species. The advances in genomics have been made possible by DNA sequencing technology. Just as information technology has led to Google Maps, enabling us to get detailed information about locations around the globe, genomic information is used to create similar maps of the DNA of different organisms.

## Mapping genomes

Genome mapping is the process of finding the location of genes on each chromosome. The maps that are created are comparable to the maps that we use to navigate streets. A genetic map is an illustration that lists genes and their location on a chromosome. Genetic maps provide the big picture (similar to a map of interstate highways) and use genetic markers (similar to landmarks). A genetic marker is a gene or sequence on a chromosome that shows genetic linkage with a trait of interest. The genetic marker tends to be inherited with the gene of interest. One measure of distance between them is the recombination frequency during meiosis; early geneticists called this linkage analysis.

Physical maps get into the intimate details of smaller regions of the chromosomes (similar to a detailed road map). A physical map is a representation of the physical distance, in nucleotides, between genes or genetic markers. Both genetic linkage maps and physical maps are required to build a complete picture of the genome. Having a complete map of the genome makes it easier for researchers to study individual genes. Human genome maps help researchers in their efforts to identify human disease-causing genes related to illnesses such as cancer, heart disease, and cystic fibrosis, to name a few. In addition, genome mapping can be used to help identify organisms with beneficial traits, such as microbes with the ability to clean up pollutants or even prevent pollution. Research involving plant genome mapping may lead to agricultural methods that produce higher crop yields or to the development of plants that adapt better to climate change.

Figure 1. This is a physical map of the human X chromosome.

Credit: modification of work by NCBI, NIH

Genetic maps provide the outline, and physical maps provide the details. It is easy to understand why both types of genome-mapping techniques are important to show the big picture. Information obtained from each technique is used in combination to study the genome. Genomic mapping is used with different model organisms that are used for research. Genome mapping is still an ongoing process, and as more advanced techniques are developed, more advances are expected. Genome mapping is similar to completing a complicated puzzle using every piece of available data. Mapping information generated in laboratories all over the world is entered into central databases, such as the National Center for Biotechnology Information (NCBI). Efforts are made to make the information more easily accessible to researchers and the general public. Just as we use global positioning systems instead of paper maps to navigate through roadways, NCBI allows us to use a genome viewer tool to simplify the data mining process.

## Whole genome sequencing

Although there have been significant advances in the medical sciences in recent years, doctors are still confounded by many diseases, and researchers are using whole genome sequencing to get to the bottom of the problem. Whole genome sequencing is a process that determines the DNA sequence of an entire genome. Whole genome sequencing is a brute-force approach to problem solving when there is a genetic basis at the core of a disease. Several laboratories now provide services to sequence, analyze, and interpret entire genomes.

In 2010, whole genome sequencing was used to save a young boy whose intestines had multiple mysterious abscesses. The child had several colon operations with no relief. Finally, a whole genome sequence revealed a defect in a pathway that controls apoptosis (programmed cell death). A bone marrow transplant was used to overcome this genetic disorder, leading to a cure for the boy. He was the first person to be successfully diagnosed using whole genome sequencing.

The first genomes to be sequenced, such as those belonging to viruses, bacteria, and yeast, were smaller in terms of the number of nucleotides than the genomes of multicellular organisms. The genomes of other model organisms, such as the mouse (Mus musculus), the fruit fly (Drosophila melanogaster), and the nematode (Caenorhabditis elegans) are now known. A great deal of basic research is performed in model organisms because the information can be applied to other organisms. A model organism is a species that is studied as a model to understand the biological processes in other species that can be represented by the model organism. For example, fruit flies are able to metabolize alcohol like humans, so the genes affecting sensitivity to alcohol have been studied in fruit flies in an effort to understand the variation in sensitivity to alcohol in humans. Having entire genomes sequenced helps with the research efforts in these model organisms.

Figure 2. Much basic research is done with model organisms, such as the mouse, Mus musculus; the fruit fly, Drosophila melanogaster; the nematode, Caenorhabditis elegans; the yeast, Saccharomyces cerevisiae; and the common weed, Arabidopsis thaliana.

Credit: "mouse": modification of work by Florean Fortescuecredit; "nematodes": modification of work by "snickclunk"/Flickr; "common weed": modification of work by Peggy Greb, USDA; scale-bar data from Matt Russell

The first human genome sequence was published in 2003. The number of whole genomes that have been sequenced steadily increases and now includes hundreds of species and thousands of individual human genomes.

## Applying genomics

The introduction of DNA sequencing and whole genome sequencing projects, particularly the Human Genome Project, has expanded the applicability of DNA sequence information. Genomics is now being used in a wide variety of fields, such as metagenomics, pharmacogenomics, and mitochondrial genomics. The most commonly known application of genomics is to understand and find cures for diseases.

### Predicting disease risk at the individual level

Predicting the risk of disease involves screening and identifying currently healthy individuals by genome analysis at the individual level. Intervention with lifestyle changes and drugs can be recommended before disease onset. However, this approach is most applicable when the problem arises from a single gene mutation. Such defects only account for about five percent of diseases found in developed countries. Most of the common diseases, such as heart disease, are multifactorial or polygenic, which refers to a phenotypic characteristic that is determined by two or more genes, and also environmental factors such as diet. In April 2010, scientists at Stanford University published the genome analysis of a healthy individual (Stephen Quake, a scientist at Stanford University, who had his genome sequenced); the analysis predicted his propensity to acquire various diseases. A risk assessment was done to analyze Quake’s percentage of risk for 55 different medical conditions. A rare genetic mutation was found that showed him to be at risk for sudden heart attack. He was also predicted to have a 23 percent risk of developing prostate cancer and a 1.4 percent risk of developing Alzheimer’s disease. The scientists used databases and several publications to analyze the genomic data. Even though genomic sequencing is becoming more affordable and analytical tools are becoming more reliable, ethical issues surrounding genomic analysis at a population level remain to be addressed. For example, could such data be legitimately used to charge more or less for insurance or to affect credit ratings?

### Genome-wide association studies

Since 2005, it has been possible to conduct a type of study called a genome-wide association study, or GWAS. A GWAS is a method that identifies differences between individuals in single nucleotide polymorphisms (SNPs) that may be involved in causing diseases. The method is particularly suited to diseases that may be affected by one or many genetic changes throughout the genome. It is very difficult to identify the genes involved in such a disease using family history information. The GWAS method relies on a genetic database that has been in development since 2002 called the International HapMap Project. The HapMap Project sequenced the genomes of several hundred individuals from around the world and identified groups of SNPs. The groups include SNPs that are located near eachother on chromosomes so they tend to stay together through recombination. The fact that the group stays together means that identifying one marker SNP is all that is needed to identify all the SNPs in the group. There are several million SNPs identified, but identifying them in other individuals who have not had their complete genome sequenced is much easier because only the marker SNPs need to be identified.

In a common design for a GWAS, two groups of individuals are chosen; one group has the disease, and the other group does not. The individuals in each group are matched in other characteristics to reduce the effect of confounding variables causing differences between the two groups. For example, the genotypes may differ because the two groups are mostly taken from different parts of the world. Once the individuals are chosen, and typically their numbers are a thousand or more for the study to work, samples of their DNA are obtained. The DNA is analyzed using automated systems to identify large differences in the percentage of particular SNPs between the two groups. Often the study examines a million or more SNPs in the DNA. The results of GWAS can be used in two ways: the genetic differences may be used as markers for susceptibility to the disease in undiagnosed individuals, and the particular genes identified can be targets for research into the molecular pathway of the disease and potential therapies. An offshoot of the discovery of gene associations with disease has been the formation of companies that provide so-called “personal genomics”, which will identify risk levels for various diseases based on an individual’s SNP complement. The science behind these services is controversial.

Because GWAS looks for associations between genes and disease, these studies provide data for other research into causes, rather than answering specific questions themselves. An association between a gene difference and a disease does not necessarily mean there is a cause-and-effect relationship. However, some studies have provided useful information about the genetic causes of diseases. For example, three different studies in 2005 identified a gene for a protein involved in regulating inflammation in the body that is associated with a disease-causing blindness called age-related macular degeneration. This opened up new possibilities for research into the cause of this disease. A large number of genes have been identified to be associated with Crohn’s disease using GWAS, and some of these have suggested new hypothetical mechanisms for the cause of the disease.

### Pharmacogenomics

Pharmacogenomics involves evaluating the effectiveness and safety of drugs on the basis of information from an individual's genomic sequence. Personal genome sequence information can be used to prescribe medications that will be most effective and least toxic on the basis of the individual patient’s genotype. Studying changes in gene expression could provide information about the gene transcription profile in the presence of the drug, which can be used as an early indicator of the potential for toxic effects. For example, genes involved in cellular growth and controlled cell death, when disturbed, could lead to the growth of cancerous cells. Genome-wide studies can also help to find new genes involved in drug toxicity. The gene signatures may not be completely accurate, but can be tested further before pathologic symptoms arise.

### Metagenomics

Traditionally, microbiology has been taught with the view that microorganisms are best studied under pure culture conditions, which involves isolating a single type of cell and culturing it in the laboratory. Because microorganisms can go through several generations in a matter of hours, their gene expression profiles adapt to the new laboratory environment very quickly. On the other hand, many species resist being cultured in isolation. Most microorganisms do not live as isolated entities, but in microbial communities known as biofilms. For all of these reasons, pure culture is not always the best way to study microorganisms. Metagenomics is the study of the collective genomes of multiple species that grow and interact in an environmental niche. Metagenomics can be used to identify new species more rapidly and to analyze the effect of pollutants on the environment. Metagenomics techniques can now also be applied to communities of higher eukaryotes, such as fish.

Figure 3. Metagenomics involves isolating DNA from multiple species within an environmental niche. The DNA is cut up and sequenced, allowing entire genome sequences of multiple species to be reconstructed from the sequences of overlapping pieces.

### Creation of new biofuels

Knowledge of the genomics of microorganisms is being used to find better ways to harness biofuels from algae and cyanobacteria. The primary sources of fuel today are coal, oil, wood, and other plant products such as ethanol. Although plants are renewable resources, there is still a need to find more alternative renewable sources of energy to meet our population’s energy demands. The microbial world is one of the largest resources for genes that encode new enzymes and produce new organic compounds, and it remains largely untapped. This vast genetic resource holds the potential to provide new sources of biofuels.

Figure 4. Renewable fuels were tested in Navy ships and aircraft at the first Naval Energy Forum.

Credit: modification of work by John F. Williams, US Navy

### Mitochondrial genomics

Mitochondria are intracellular organelles that contain their own DNA. Mitochondrial DNA mutates at a rapid rate and is often used to study evolutionary relationships. Another feature that makes studying the mitochondrial genome interesting is that in most multicellular organisms, the mitochondrial DNA is passed on from the mother during the process of fertilization. For this reason, mitochondrial genomics is often used to trace genealogy.

### Genomics in forensic analysis

Information and clues obtained from DNA samples found at crime scenes have been used as evidence in court cases, and genetic markers have been used in forensic analysis. Genomic analysis has also become useful in this field. In 2001, the first use of genomics in forensics was published. It was a collaborative effort between academic research institutions and the FBI to solve the mysterious cases of anthrax that was transported by the US Postal Service. Anthrax bacteria were made into an infectious powder and mailed to news media and two U.S. Senators. The powder infected the administrative staff and postal workers who opened or handled the letters. Five people died, and 17 were sickened from the bacteria. Using microbial genomics, researchers determined that a specific strain of anthrax was used in all the mailings; eventually, the source was traced to a scientist at a national biodefense laboratory in Maryland.

Figure 5. Bacillus anthracis is the organism that causes anthrax.

Credit: modification of work by CDC; scale-bar data from Matt Russell

### Genomics in agriculture

Genomics can reduce the trials and failures involved in scientific research to a certain extent, which could improve the quality and quantity of crop yields in agriculture. Linking traits to genes or gene signatures helps to improve crop breeding to generate hybrids with the most desirable qualities. Scientists use genomic data to identify desirable traits, and then transfer those traits to a different organism to create a new genetically modified organism, as described in the previous module. Scientists are discovering how genomics can improve the quality and quantity of agricultural production. For example, scientists could use desirable traits to create a useful product or enhance an existing product, such as making a drought-sensitive crop more tolerant of the dry season.

Figure 6. Transgenic agricultural plants can be made to resist disease. These transgenic plums are resistant to the plum pox virus.

Credit: Scott Bauer, USDA ARS

## Proteomics

Proteins are the final products of genes that perform the function encoded by the gene. Proteins are composed of amino acids and play important roles in the cell. All enzymes (except ribozymes) are proteins and act as catalysts that affect the rate of reactions. Proteins are also regulatory molecules, and some are hormones. Transport proteins, such as hemoglobin, help transport oxygen to various organs. Antibodies that defend against foreign particles are also proteins. In the diseased state, protein function can be impaired because of changes at the genetic level or because of direct impact on a specific protein.

A proteome is the entire set of proteins produced by a cell type. Proteomes can be studied using the knowledge of genomes because genes code for mRNAs, and the mRNAs encode proteins. The study of the function of proteomes is called proteomics. Proteomics complements genomics and is useful when scientists want to test their hypotheses that were based on genes. Even though all cells in a multicellular organism have the same set of genes, the set of proteins produced in different tissues is different and dependent on gene expression. Thus, the genome is constant, but the proteome varies and is dynamic within an organism. In addition, RNAs can be alternatively spliced (cut and pasted to create novel combinations and novel proteins), and many proteins are modified after translation. Although the genome provides a blueprint, the final architecture depends on several factors that can change the progression of events that generate the proteome.

Genomes and proteomes of patients suffering from specific diseases are being studied to understand the genetic basis of the disease. The most prominent disease being studied with proteomic approaches is cancer (Figure 7). Proteomic approaches are being used to improve the screening and early detection of cancer; this is achieved by identifying proteins whose expression is affected by the disease process. An individual protein is called a biomarker, whereas a set of proteins with altered expression levels is called a protein signature. For a biomarker or protein signature to be useful as a candidate for early screening and detection of a cancer, it must be secreted in bodily fluids such as sweat, blood, or urine, so that large-scale screenings can be performed in a noninvasive fashion.

The current problem with using biomarkers for the early detection of cancer is the high rate of false-negative results. A false-negative result is a negative test result that should have been positive. In other words, many cases of cancer go undetected, which makes biomarkers unreliable. Some examples of protein biomarkers used in cancer detection are CA-125 for ovarian cancer and PSA for prostate cancer. Protein signatures may be more reliable than biomarkers to detect cancer cells. Proteomics is also being used to develop individualized treatment plans, which involves the prediction of whether or not an individual will respond to specific drugs and the side effects that the individual may have. Proteomics is also being used to predict the possibility of disease recurrence.

Figure 7. This machine is preparing to do a proteomic pattern analysis to identify specific cancers so that an accurate cancer prognosis can be made.

Credit: Dorie Hightower, NCI, NIH

The National Cancer Institute has developed programs to improve the detection and treatment of cancer. The Clinical Proteomic Technologies for Cancer and the Early Detection Research Network are efforts to identify protein signatures specific to different types of cancers. The Biomedical Proteomics Program is designed to identify protein signatures and design effective therapies for cancer patients.

## Section summary

Genome mapping is similar to solving a big, complicated puzzle with pieces of information coming from laboratories all over the world. Genetic maps provide an outline for the location of genes within a genome, and they estimate the distance between genes and genetic markers on the basis of the recombination frequency during meiosis. Physical maps provide detailed information about the physical distance between the genes. The most detailed information is available through sequence mapping. Information from all mapping and sequencing sources is combined to study an entire genome.

Whole genome sequencing is the latest available resource to treat genetic diseases. Some doctors are using whole genome sequencing to save lives. Genomics has many industrial applications including biofuel development, agriculture, pharmaceuticals, and pollution control.

Imagination is the only barrier to the applicability of genomics. Genomics is being applied to most fields of biology; it can be used for personalized medicine, prediction of disease risks at an individual level, the study of drug interactions before the conduction of clinical trials, and the study of microorganisms in the environment as opposed to the laboratory. It is also being applied to the generation of new biofuels, genealogical assessment using mitochondria, advances in forensic science, and improvements in agriculture.

Proteomics is the study of the entire set of proteins expressed by a given type of cell under certain environmental conditions. In a multicellular organism, different cell types will have different proteomes, and these will vary with changes in the environment. Unlike a genome, a proteome is dynamic and under constant flux, which makes it more complicated and more useful than the knowledge of genomes alone.