5.1 DNA Isolation, Sequencing, and Synthesis
Genomic DNA vs Complimentary DNA
Genomic deoxyribonucleic acid is chromosomal DNA, in contrast to extra-chromosomal DNA such as that found in the mitochondria of mammals or plasmid structures in bacteria (Figure 5.1). Plasmids will be discussed in more detail in section 5.3 during the discussion of gene cloning and expression. It is also then abbreviated as gDNA. Most organisms have the same genomic DNA in every cell; however, only certain genes are active in each cell to allow for cell function and differentiation within the body.
The genome of an organism (encoded by the genomic DNA) is the (biological) information of heredity which is passed from one generation of organism to the next. That genome is transcribed to produce various RNAs, which are necessary for the function of the organism. Precursor mRNA (pre-mRNA) is transcribed by RNA polymerase II in the nucleus. pre-mRNA is then processed by splicing to remove introns, leaving the exons in the mature messenger RNA (mRNA). Additional processing includes the addition of a 5' cap and a poly(A) tail to the pre-mRNA. The mature mRNA may then be transported to the cytosol and translated by the ribosome into a protein. Other types of RNA include ribosomal RNA (rRNA) and transfer RNA (tRNA). These types are transcribed by RNA polymerase II and RNA polymerase III, respectively, and are essential for protein synthesis. However 5s rRNA is the only rRNA which is transcribed by RNA Polymerase III.
In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA (e.g., messenger RNA (mRNA) or microRNA) template in a reaction catalyzed by the enzyme reverse transcriptase (Figure 5.1). Reverse transcriptase is an enzyme found in retroviruses such as HIV that have RNA as their core genetic material. Upon entering the host cell the RNA is reverse transcribed to produce a copy of cDNA that can then integrate into the hosts genomic DNA. In biotechnology, reverse transcriptase is often used to create cDNA from the mRNA expressed in specific cells or tissues. In this way, the eukaryotic genes can be cloned without any introns housed in the structure. This is especially useful if the goal is to express the protein in a prokaryotic (bacterial) host. Recall that bacterial DNA does not house any intron sequences within its chromosomal DNA. Thus, if you are using a prokaryotic system to express eukaryotic proteins, you must use cDNA, as the prokaryotic system will not be able to remove intron sequences following gene transcription.
The term cDNA is also used, typically in a bioinformatics context, to refer to an mRNA transcript's sequence, expressed as DNA bases (GCAT) rather than RNA bases (GCAU).
- cDNA is derived from mRNA, so it contains only exons but no introns.
Figure 5.1. Genomic DNA (gDNA) vs Complimentary DNA (cDNA). Lefthand Diagram shows the processing of genomic DNA within a cell to produce a protein (Upper blue panel shows the structural elements common to eukaryotic genes. The process of gene transcription produces a messenger RNA (mRNA) molecule that must be modified post-translationally, gray panel, to remove the non-coding intron sequences and add the 5'-CAP and Poly-A-Tail sections. The mature mRNA is transported from the nucleus to the cytoplasm where it is translated by the ribosome into the protein sequence, red panel.) Righthand Diagram shows that the isolation of mRNA from a cell can be used to synthesize cDNA using the enzyme reverse transcriptase. The resulting cDNA only contains elements from the mature mRNA, including the exons and poly-A tail.
Image modified from Wikipedia
DNA Isolation Techniques
DNA isolation is a process of purification of DNA from sample using a combination of physical and chemical methods. The first isolation of DNA was done in 1869 by Friedrich Miescher. Currently it is a routine procedure in molecular biology or forensic analyses. For the chemical method, there are many different kits used for extraction, and selecting the correct one will save time on kit optimization and extraction procedures. PCR sensitivity detection is considered to show the variation between the commercial kits.
There are three standard DNA purification techniques described below:
- Cells which are to be studied need to be collected.
- Breaking the cell membranes open to expose the DNA along with the cytoplasm within (cell lysis).
- Lipids from the cell membrane and the nucleus are broken down with detergents and surfactants.
- Breaking proteins by adding a protease (optional).
- Breaking RNA by adding an RNase (optional).
- The solution is treated with concentrated salt solution (saline) to make debris such as broken proteins, lipids and RNA to clump together.
- Centrifugation of the solution, which separates the clumped cellular debris from the DNA.
- DNA purification from detergents, proteins, salts and reagents used during cell lysis step. The most commonly used procedures are:
- Ethanol precipitation usually by ice-cold ethanol or isopropanol. Since DNA is insoluble in these alcohols, it will aggregate together, giving a pellet upon centrifugation. Precipitation of DNA is improved by increasing of ionic strength, usually by adding sodium acetate.
- Phenol–chloroform extraction in which phenol denatures proteins in the sample. After centrifugation of the sample, denaturated proteins stay in the organic phase while aqueous phase containing nucleic acid is mixed with the chloroform that removes phenol residues from solution.
- Minicolumn purification that relies on the fact that the nucleic acids may bind (adsorption) to the solid phase (silica or other) depending on the pH and the salt concentration of the buffer (Figure 5.2).
Cellular and histone proteins bound to the DNA can be removed either by adding a protease or by having precipitated the proteins with sodium or ammonium acetate, or extracted them with a phenol-chloroform mixture prior to the DNA-precipitation.
After isolation, the DNA is dissolved in slightly alkaline buffer, usually in a Tris-EDTA buffer, or in ultra-pure water. Modifications made to these standard techniques are often done if the tissue being used is difficult to breakdown, if contaminants persist in the lysis solution that inhibit further reactions, or if the sample is extremely minimal, as is often the case in forensic investigations. In addition, different commercial kits will be tailored for the isolation of larger genomic DNA or smaller plasmid DNA.
Figure 5.2 Silica Spin Column Used for DNA Purification. Spin column-based nucleic acid purification is a solid phase extraction method to quickly purify nucleic acids. This method relies on the fact that nucleic acid will bind to the solid phase of silica under certain conditions and then released when those conditions are altered. For binding, a buffer solution is added to the DNA lysate along with ethanol or isopropanol. This forms the binding solution. The binding solution is transferred to a spin column and the column is put in a centrifuge. The centrifuge forces the binding solution through a silica gel membrane that is inside the spin column. If the pH and salt concentration of the binding solution are optimal, the nucleic acid will bind to the silica gel membrane as the solution passes through. To wash non-specific cellular components from the column, the flow-through is removed and a wash buffer is added to the column. The column is put in a centrifuge again, forcing the wash buffer through the membrane. This removes any remaining impurities from the membrane, leaving only the nucleic acid bound to the silica gel. To elute, the wash buffer is removed and a low salt elution buffer (or simply water) is added to the column. The column is put in a centrifuge again, forcing the elution buffer through the membrane. The elution buffer displaces the nucleic acid from the column allowing it to be collected in the flow through. Unlike RNA which degrades very quickly, DNA is quite stable and can be stored for long periods of time at -20oC.
Image by Squidonius
DNA Sequencing Techniques
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.
Knowledge of DNA sequences has become indispensable for basic biological research, and in numerous applied fields such as medical diagnosis, biotechnology, forensic biology, virology and biological systematics. Comparing healthy and mutated DNA sequences can diagnose different diseases including various cancers, characterize antibody repertoire, and can be used to guide patient treatment. Having a quick way to sequence DNA allows for faster and more individualized medical care to be administered, and for more organisms to be identified and cataloged.
The rapid speed of sequencing attained with modern DNA sequencing technology has been instrumental in the sequencing of complete DNA sequences, or genomes, of numerous types and species of life, including the human genome and other complete DNA sequences of many animal, plant, and microbial species.
The first DNA sequences were obtained in the early 1970s by academic researchers using laborious methods based on two-dimensional chromatography. Following the development of fluorescence-based sequencing methods with a DNA sequencer, DNA sequencing has become easier and orders of magnitude faster.
The canonical structure of DNA has four bases: thymine (T), adenine (A), cytosine (C), and guanine (G). DNA sequencing is the determination of the physical order of these bases in a molecule of DNA. However, DNA bases are often modified by epigenetic processes to control gene expression. Thus, many other modified bases that may be present in a DNA molecule than the standard four bases. For example, in some viruses (specifically, bacteriophage), cytosine may be replaced by hydroxymethyl- or hydroxymethylglucose cytosine. In eukaryotic DNA, variant bases with methyl groups or phosphosulfate may be found (Figure 5.3). Depending on the sequencing technique, a particular modification, e.g., the 5mC (5 -methylcytosine) common in humans, may or may not be detected.
Figure 5.3 DNA modifications with epigenetic regulatory functions and their interdependencies. Cytosine (C) is methylated to 5-methylcytosine (5mC) by DNA methyltransferases (DNMT) and then further oxidized to 5hmC, 5fC and 5caC by Tet dioxygenases. 5-Hydroxyuracil (5hmU) is produced by Tet-catalysed oxidation of thymine (T). N6-methyladenine (6mA) is likely catalysed by DNA N6 adenine methyltransferases (DAMT-1 in C. elegans), even though the biochemical activity of these enzymes remains to be characterized. The Tet-like ALKB enzymes NMAD (N6-methyl adenine demethylase 1) and DMAD (DNA 6mA demethylase) have been shown to be involved in 6mA demethylation in C. elegans and in Drosophila, respectively, possibly by using a conserved dioxygenase mechanism.
Early DNA sequencing methods
The first method for determining DNA sequences involved a location-specific primer extension strategy established by Ray Wu at Cornell University in 1970. DNA polymerase catalysis and specific nucleotide labeling, both of which figure prominently in current sequencing schemes, were used to sequence the cohesive ends of lambda phage DNA. Between 1970 and 1973, Wu, R Padmanabhan and colleagues demonstrated that this method can be employed to determine any DNA sequence using synthetic location-specific primers. Frederick Sanger then adopted this primer-extension strategy to develop more rapid DNA sequencing methods at the MRC Centre, Cambridge, UK and published a method for "DNA sequencing with chain-terminating inhibitors" in 1977. Walter Gilbert and Allan Maxam at Harvard also developed sequencing methods, including one for "DNA sequencing by chemical degradation". In 1973, Gilbert and Maxam reported the sequence of 24 basepairs using a method known as wandering-spot analysis. Advancements in sequencing were aided by the concurrent development of recombinant DNA technology, allowing DNA samples to be isolated from sources other than viruses.
Maxam-Gilbert sequencing requires radioactive labeling at one 5' end of the DNA and purification of the DNA fragment to be sequenced. Chemical treatment then generates breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). The concentration of the modifying chemicals is controlled to introduce on average one modification per DNA molecule. Thus a series of labeled fragments is generated, from the radiolabeled end to the first "cut" site in each molecule. The fragments in the four reactions are electrophoresed side by side in denaturing acrylamide gels for size separation. To visualize the fragments, the gel is exposed to X-ray film for autoradiography, yielding a series of dark bands each corresponding to a radiolabeled DNA fragment, from which the sequence may be inferred.
The technical aspects of Maxam-Gilbert sequencing caused it to go out of favor once the Sanger sequencing method had been well established, as described below.
Sanger Sequencing Method
The chain-termination method developed by Frederick Sanger and coworkers in 1977 soon became the method of choice, owing to its relative ease and reliability. When invented, the chain-terminator method used fewer toxic chemicals and lower amounts of radioactivity than the Maxam-Gilbert method. Because of its comparative ease, the Sanger method was soon automated and was the method used in the first generation of DNA sequencers.
The classical chain-termination method requires a single-stranded DNA template, a DNA primer, a DNA polymerase, normal deoxynucleotidetriphosphates (dNTPs), and modified di-deoxynucleotidetriphosphates (ddNTPs), the latter of which terminate DNA strand elongation. These chain-terminating nucleotides lack a 3'-OH group required for the formation of a phosphodiester bond between two nucleotides, causing DNA polymerase to cease extension of DNA when a modified ddNTP is incorporated. The ddNTPs may be radioactively or fluorescently labelled for detection in automated sequencing machines.
The DNA sample is divided into four separate sequencing reactions, containing all four of the standard deoxynucleotides (dATP, dGTP, dCTP and dTTP) and the DNA polymerase. To each reaction is added only one of the four dideoxynucleotides (ddATP, ddGTP, ddCTP, or ddTTP), while the other added nucleotides are ordinary ones (Figure 5.4).
Figure 5.4. Fluorescent ddNTPs for Sanger Sequencing. Dideoxynucleotides are utilized for sequencing as they cannot be extended further once they are incorporated into the nacent DNA.
Image by Fibonachi
The dideoxynucleotide concentration should be approximately 100-fold lower than that of the corresponding deoxynucleotide (e.g. 0.005mM ddTTP : 0.5mM dTTP) to allow enough fragments to be produced while still transcribing the complete sequence. In total, four separate reactions are needed in this process to test all four ddNTPs (Figure 5.5).
Figure 5.5. The Sanger (chain-termination) method for DNA sequencing. (1) A primer is annealed to a sequence, (2) Reagents are added to the primer and template, including: DNA polymerase, dNTPs, and a small amount of all four dideoxynucleotides (ddNTPs) labeled with fluorophores. During primer elongation, the random insertion of a ddNTP instead of a dNTP terminates synthesis of the chain because DNA polymerase cannot react with the missing hydroxyl. This produces all possible lengths of chains. (3) The products are separated on a single lane capillary gel, where the resulting bands are read by a imaging system. (4) This produces several hundred thousand nucleotides a day, data which require storage and subsequent computational analysis.
Image by Estevezj
Following rounds of template DNA extension from the bound primer, the resulting DNA fragments are heat denatured and separated by size using gel electrophoresis. This technique was frequently performed using a denaturing polyacrylamide-urea gel with each of the four reactions run in one of four individual lanes (lanes A, T, G, C). The DNA bands may then be visualized by autoradiography or UV light and the DNA sequence can be directly read off the X-ray film or gel image (Figure 5.6).
Figure 5.6. Traditional Sanger Sequencing Gel. Sequence visualized by autoradiography. Each lane contains a single reactions that has all four regular nucleotides and a small amount of one of the dideoxynucleotides (ddNTPs). Overtime, the ddNTPs will be incorporated at each position containing that specific nucleotide. The gel can then be read from the bottom to the top, as the smallest fragments (those fragments terminated the closest to the primer at the 5'-end) will run the farthest distance in the gel. The sequence of this fragment is:
Image by John Schmidt
Automation of the Sanger sequencing method was made possible when the shift from radioactively tagged nucleotides to fluorescently tagged nucleotides was made. Within the automated sequencers, capillary gel electrophoresis is performed rather than separating the samples using gel electrophoresis. The output from capillary electrophoresis are fluorescent peak trace chromatograms (Figure 5.7). Automated DNA-sequencing instruments (DNA sequencers) can sequence up to 384 DNA samples in a single batch. Batch runs may occur up to 24 times a day greatly enhancing the speed with which samples may be sequenced and analyzed. Common challenges of DNA sequencing with the Sanger method include poor quality in the first 15-40 bases of the sequence due to primer binding and deteriorating quality of sequencing traces after 400-500 bases.
Figure 5.7 Side by Side Comparison of Gel Electrophoresis and Capillary Electrophoresis. Lefthand Diagram shows the traditional autoradiogram of Sanger sequencing samples. The Righthand Diagram shows the same reactions using fluorescently tagged ddNTPs separated by capillary electrophoresis. The chromatogram output is shown on the far right.
Image by Abizar
Sanger sequencing is the method which prevailed from the 1980s until ~2005. Over that period, great advances were made in the technique, such as fluorescent labeling, capillary electrophoresis, and general automation. These developments allowed much more efficient sequencing, leading to lower costs. The Sanger method, in mass production form, is the technology which produced the first human genome in 2001, ushering in the age of genomics.
Microfluidic Sanger Sequencing
Microfluidic Sanger sequencing is a lab-on-a-chip application for DNA sequencing, in which the Sanger sequencing steps (thermal cycling, sample purification, and capillary electrophoresis) are integrated on a wafer-scale chip using nanoliter-scale sample volumes (Figure 5.8). This technology generates long and accurate sequence reads, while obviating many of the significant shortcomings of the conventional Sanger method (e.g. high consumption of expensive reagents, reliance on expensive equipment, personnel-intensive manipulations, etc.) by integrating and automating the Sanger sequencing steps.
Figure 5.8 Lab-On-A-Chip Technologies. Example of a microfluidic lab-on-a chip-device sitting on a polystyrene dish. Stainless steel needles inserted into the device serve as access points for fluids into small channels within the device, which are about the size of a human hair.
Next Generation Sequencing
Next-generation sequencing (NGS), also known as high-throughput sequencing, is the catch-all term used to describe a number of different modern sequencing technologies. These technologies allow for sequencing of DNA and RNA much more quickly and cheaply than the previously used Sanger sequencing, and as such revolutionized the study of genomics and molecular biology. Such technologies include:
Illumina Sequencing - In NGS, vast numbers of short reads are sequenced in a single stroke using the lab-on-a-chip technology described above. To do this, the input sample must be cleaved into short sections. In Illumina sequencing, 100-150bp reads are used. Somewhat longer fragments are ligated to generic adaptors and annealed to a slide using the adaptors. PCR is carried out to amplify each read, creating a spot with many copies of the same read. They are then separated into single stranded DNA to be sequenced (Figure 5.9).
Figure 5.9 Procedure for Illumina Sequencing. (A) The slide with PCR amplified fragments of DNA is flooded with nucleotides and DNA polymerase. These nucleotides are fluorescently labelled with each color corresponding to a specific base. The reactions also have a terminator present, so that only one base is added at a time. (B) An image is taken of the slide. In each reaction location, there will be a fluorescent signal indicating that a specific base that has been added. (C) The data is recorded and the slide is then prepared for the next cycle. In preparation, the terminators are removed, which will allow the next base to be added, and the fluorescent signal is cleaved, preventing the fluorescent signal from contaminating the next image. The process is repeated, adding one nucleotide at a time (G, A, T, or C) and imaging in between. All of the sequence reads will be the same length as single bases are added at each cycle.
Image modified from EMBL-EBI
Roche 454-Sequencing is similar to the Illumina process but can sequence much longer reads. Like Illumina, it does this by sequencing multiple reads at once by reading optical signals as bases are added.
As in Illumina, the DNA or RNA is fragmented into shorter reads, in this case up to 1kb (1,000bp). Generic adaptors are added to the ends and these are annealed to beads, one DNA fragment per bead. The fragments are then amplified by PCR using adaptor-specifc primers. Each bead is then placed in a single well of a slide. So each well will contain a single bead, covered in many PCR copies of a single sequence. The wells also contain DNA polymerase and sequencing buffers (Figure 5.10).
5.10 Procedure for Roche 454 Sequencing. (A) Once the PCR product is attached to the bead, the slide is flooded with one of the four NTP species. Where this nucleotide is next in the sequence, it is added to the sequence read. If that single base repeats, then more will be added. So if we flood with Guanine bases, and the next in the sequence is G, one G will be added, however if the next part of the sequence is GGGG, then four Gs will be added. (B) The addition of each nucleotide releases a light signal. These locations of signals are detected and used to determine which beads the nucleotides are added to. (C) The NTP mix is washed away. The next NTP mix is now added and the process repeated, cycling through the four NTPs. All of the sequence reads from 454 sequencing will be different lengths, because different numbers of bases will be added with each cycle.
Image modified by EMBL-EBI
Newer technologies such as the Ion Torrent Technology and the MinION System detect sequence data using electrical signals on a semiconductor chip, rather than optically reading dye-labeled nucleotides. This is possible as the addition of a dNTP to the DNA polymer causes the release of an H+ ion (Figure 5.11). As in other kinds of NGS, the input DNA or RNA is fragmented, this time ~200bp. Adaptors are added and one molecule is placed onto a bead. The molecules are amplified on the bead by emulsion PCR. Each bead is placed into a single well of a slide.
Figure 5.11 Ion Torrent Sequencing Technology. (A) Similar to 454 sequencing, the slide is flooded with a single species of dNTP, along with buffers and polymerase. The pH is monitored in each of the wells after the addition of the specific dNTP. The pH will decrease when the dNTP is incorporated into the polymer causing the release of a proton (H+). The changes in pH allow us to determine if that base and how many thereof, were added to the sequence read. (B) The dNTPs are washed away and the process is repeated cycling through the different dNTP species. (C) The pH change, if any, is used to determine how many bases (if any) were added with each cycle.
Image modified from EMBL-EBI
Such ion technologies that do not require optical detection, have enabled the production of small hand-held DNA sequencing devices that can be plugged into the USB drive on a laptop computer and utilized in the field under real time collection conditions (Figure 5.12).
Figure 5.12 The MinION Portable Real-Time Sequencing Device. The MinION can produce up to 30 Gb of DNA sequence data per sample
Image by Oxford Nanopore Technologies
The four main advantages of NGS over classical Sanger sequencing are:
NGS is significantly cheaper, quicker, needs significantly less DNA and is more accurate and reliable than Sanger sequencing. Let us look at this more closely. For Sanger sequencing, a large amount of template DNA is needed for each read. Several strands of template DNA are needed for each base being sequenced (i.e. for a 100bp sequence you'd need many hundreds of copies, for a 1000bp sequence you'd need many thousands of copies), as a strand that terminates on each base is needed to construct a full sequence. In NGS, a sequence can be obtained from a single strand. In both kinds of sequencing multiple staggered copies are taken for contig construction and sequence validation.
NGS is quicker than Sanger sequencing in two ways. Firstly, the chemical reaction may be combined with the signal detection in some versions of NGS, whereas in Sanger sequencing these are two separate processes. Secondly and more significantly, only one read (maximum ~1kb) can be taken at a time in Sanger sequencing, whereas NGS is massively parallel, allowing 300Gb of DNA to be read on a single run on a single chip.
The reduced time, manpower and reagents in NGS mean that the costs are much lower. The first human genome sequence cost in the region of $2.7 billion in 2003. Using modern Sanger sequencing methods, aided by data from the known sequence, a full human genome still cost $300,000 in 2006. Sequencing a human genome with NGS today costs roughly $1,000.
Repeats are intrinsic to NGS, as each read is amplified before sequencing, and because it relies on many short overlapping reads, so each section of DNA or RNA is sequenced multiple times. Also, because it is so much quicker and cheaper, it is possible to do more repeats than with Sanger sequencing. More repeats means greater coverage, which leads to a more accurate and reliable sequence, even if individual reads are less accurate for NGS.
Sanger sequencing can be used to give much longer sequence reads. However, the parallel nature of NGS means that longer reads can be constructed from many contiguous short reads.
DNA Synthesis Techniques
DNA synthesis is the natural or artificial creation of deoxyribonucleic acid (DNA) molecules. The term DNA synthesis can refer to DNA replication (which will be covered in more detail in Chapter XX), polymerase chain reaction (PCR) or gene synthesis (physically creating artificial gene sequences).
Polymerase Chain Reaction (PCR)
Polymerase chain reaction (PCR) refers to a technique employed widely in the basic and biomedical sciences. PCR is a laboratory technique utilized to amplify specific segments of DNA for a wide range of laboratory and/or clinical applications. Building on the work of Panet and Khorana’s successful amplification of DNA in-vitro, Kary Mullis and coworkers developed PCR in the early 1980s, having been met with a Nobel prize only a decade later. Allowing for more than the billion-fold amplification of specific target regions, it has become instrumental in many applications including the cloning of genes, the diagnosis of infectious diseases, and the screening of prenatal infants for deleterious genetic abnormalities.
The main components of PCR are a template, primers, free nucleotide bases, and the DNA polymerase enzyme. The DNA template contains the specific region that you wish to amplify, such as the DNA extracted from a piece of hair for example. Primers, or oligonucleotides, are short strands of single-stranded DNA complementary to the 3' end of each target region. Both a forward and a reverse primer are required, one for each complementary strand of DNA. DNA polymerase is the enzyme that carries out DNA replication. Thermostable analogues of DNA polymerase I, such as Taq polymerase, which was originally found in a bacterium that grows in hot springs, is a common choice due to its resistance to the heating and cooling cycles necessary for PCR.
PCR takes advantage of the complementary base pairing, double-stranded nature, and melting temperature of DNA molecules. This process involves cycling through 3 sequential rounds of temperature dependent reactions: DNA melting (denaturation), annealing and enzyme-driven DNA replication (elongation). Denaturation begins by heating the reaction to about 95oC, disrupting the hydrogen bonds that hold the two strands of template DNA together. Next, the reaction is reduced to around 50 to 65oC, depending on the physicochemical variables of the primers, enabling annealing of complementary base pairs. The primers, which are added to the solution in excess, bind to the beginning of the 3' end of each template strand and prevent re-hybridization of the template strand with itself. Lastly, enzyme-driven DNA replication, or elongation, begins by setting the reaction temperature to the amount which optimizes the activity of DNA polymerase, which is around 75 to 80oC. At this point, DNA polymerase, which needs double-stranded DNA to begin replication, synthesizes a new DNA strand by assembling free-nucleotides in solution in the 3' to 5' direction to produce 2 full sets of complementary strands. The newly synthesized DNA is now identical to the template strand and will be used as such in the progressive PCR cycles (Video 5.1).
[video width="1280" height="720" mp4="https://wou.edu/chemistry/files/2019/09/Figs1.mp4"][/video]
Video 5.1: Polymerase Chain Reaction (PCR). (1) Steps of Traditional PCR Process. (2) Non-specific fluorophore detection in qPCR, and (3) Specific hybridization probe detection in qPCR.
Video by Mousa Ghannam
Given that previously synthesized DNA strands serve as templates, the amplification of DNA using PCR increases at an exponential rate, where the copies of DNA double at the end of each replication step. The exponential replication of the target DNA eventually plateaus around 30 to 40 cycles mainly due to reagent limitation, but can also be due to inhibitors of the polymerase reaction found in the sample, self-annealing of the accumulating product, and accumulation of pyrophosphate molecules.
At its advent, PCR technology was limited to qualitative and or semi-quantitative analysis due to limitations on the ability to quantitate nucleic acids. At that time, to verify if the target gene was amplified successfully, the DNA product was separated by size via agarose gel electrophoresis. Ethidium bromide, a molecule that fluoresces when bound to dsDNA, could give a rough estimate of DNA amount by roughly comparing the brightness of separated bands, but was not sensitive enough for rigorous quantitative analysis.
Improvements in fluorophore development and instrumentation led to thermocyclers that no longer required measurement of only end-product DNA. This process, known as real-time PCR, or quantitative PCR (qPCR), has allowed for the detection of dsDNA during amplification. qPCR thermocyclers are equipped with the ability to excite fluorophores at specific wavelengths, detect their emission with a photodetector, and record the values. The sensitive collection of numerical values during amplification has strongly enhanced quantitative analytical power.
There are two main types of fluorophores used in qPCR: those that bind specifically to a given target sequence and those that do not. The sensitivity of fluorophores has been an important aspect of qPCR development. One of the most effective and widely used non-specific markers, SYBR Green, after binding to the minor groove of dsDNA, exhibits a 1000-fold increase in fluorescence compared to being free in solution (Video 5.1). However, if even more specificity is desired, a sequence-specific oligonucleotide, or hybridization probe, can be added, which binds to the target gene at some point in front of the primer (after the 3' end). These hybridization probes contain a reporter molecule at the 5' end and a quencher molecule at the 3' end. The quencher molecule effectively inhibits the reporter from fluorescing while the probe is intact. However, upon contact with DNA polymerase I, the hybridization probe is cleaved, allowing for the fluorescence of the dye (Video 5.1).
Since its advent, PCR technology has been creatively expanded upon, and reverse-transcription PCR (RT-PCR) is one of the most important advances. Real-time PCR is frequently confused with reverse-transcription PCR, but they are separate techniques. In RT-PCR, the DNA amplified is derived from mRNA by using reverse-transcriptase enzymes, to produce a cDNA copy of the gene. Using primers sequences for genes of interest, traditional PCR methods can be used with the cDNA to study the expression of genes qualitatively. Currently, reverse-transcription PCR is commonly used with real-time PCR, which allows one to quantitatively measure the relative change in gene expression across different samples.
Issues of Concern
One disadvantage of PCR technology is that it is extremely sensitive. Trace amounts of RNA or DNA contamination in the sample can produce extremely misleading results. Another disadvantage is that the primers designed for PCR requires sequence data, and therefore can only be used to identify the presence or absence of a known pathogen or gene. Another limitation is that sometimes the primers used for PCR can anneal non-specifically to sequences that are similar, but not identical, to the target gene.
Another potential issue of using PCR is the possibility of primer dimer (PD) formation. PD is a potential by-product and consists of primer molecules that have hybridized to each other due to the strings of complementary bases in the primers. The DNA polymerase amplifies the PD, leading to competition for PCR reagents that could be used to amplify the target sequences.
PCR amplification is an indispensable tool with various applications within medicine. Often, it is used to test for the presence of specific alleles, such as in the case of prospective parents screening for genetic carriers, but it can also be used to diagnose the presence of disease directly and for mutations in the developing embryo. For example, the first time PCR was used in this way was for the diagnosis of sickle cell anemia through the detection of a single gene mutation.
Additionally, PCR has greatly revolutionized the diagnostic potential for infectious diseases, as it can be used to rapidly determine the identity of microbes that were traditionally unable to be cultured, or that required weeks for growth. Pathogens routinely detected using PCR include Mycobacterium tuberculosis, human immunodeficiency virus, herpes simplex virus, syphilis, and countless other pathogens. Moreover, qPCR is not only used for testing the qualitative presence of microbes but also to quantify the bacterial, fungal, and viral loads.
The sensitivity of diagnostic tools for mutations to oncogenes and tumor suppression genes has been improved at least 10,000 fold due to PCR, allowing for earlier diagnosis of cancers like leukemia. PCR has also enabled more nuanced and individualized therapies for cancer patients. Additionally, PCR can be used for the tissue typing done that is vital to organ implantation and has even been proposed as a replacement for antibody-based tests for blood type. PCR also has clinical applications in the field of prenatal testing for various genetic diseases and/or clinical pathologies. Samples are obtained either via amniocentesis or chorionic villus sampling.
In forensic medicine, short pieces of repeating, highly polymorphic DNA, coined short tandem repeats (STRs), are amplified and used to compare specific variation within genes to differentiate individuals. Primers specific for the loci of these STRs are used and amplified using PCR. Various loci contain STRs in the human genome, and the statistical power of this technique is enhanced by checking multiple sites.
Artificial gene synthesis, sometimes known as DNA printing is a method in synthetic biology that is used to create artificial genes in the laboratory. Based on solid-phase DNA synthesis, it differs from molecular cloning and polymerase chain reaction (PCR) in that it does not have to begin with preexisting DNA sequences. Therefore, it is possible to make a completely synthetic double-stranded DNA molecule with no apparent limits on either nucleotide sequence or size.
The method has been used to generate functional bacterial or yeast chromosomes containing approximately one million base pairs. Creating novel nucleobase pairs in addition to the two base pairs in nature could greatly expand the genetic code.
Synthesis of the first complete gene, a yeast tRNA, was demonstrated by Har Gobind Khorana and coworkers in 1972. Synthesis of the first peptide- and protein-coding genes was performed in the laboratories of Herbert Boyer and Alexander Markham, respectively.
Commercial gene synthesis services are now available. Approaches are most often based on a combination of organic chemistry and molecular biology techniques and entire genes may be synthesized "de novo", without the need for template DNA. Gene synthesis is an important tool in many fields of recombinant DNA technology including heterologous gene expression, vaccine development, gene therapy and molecular engineering. The synthesis of nucleic acid sequences can be more economical than classical cloning and mutagenesis procedures. It is also a powerful and flexible engineering tool for creating and designing new DNA sequences and protein functions.
While the ability to make increasingly long stretches of DNA efficiently and at lower prices is a technological driver of this field, increasingly attention is being focused on improving the design of genes for specific purposes. Early in the genome sequencing era, gene synthesis was used as an (expensive) source of cDNAs that were predicted by genomic or partial cDNA information but were difficult to clone. As higher quality sources of sequence verified cloned cDNA have become available, this practice has become less urgent.
Producing large amounts of protein from gene sequences (or at least the protein coding regions of genes, the open reading frame) found in nature can sometimes prove difficult and is a problem of sufficient impact that scientific conferences have been devoted to the topic. Many of the most interesting proteins sought by molecular biologists are normally regulated to be expressed in very low amounts in wild type cells. Redesigning these genes offers a means to improve gene expression in many cases. Rewriting the open reading frame is possible because of the degeneracy of the genetic code. Thus it is possible to change up to about a third of the nucleotides in an open reading frame and still produce the same protein. The available number of alternate designs possible for a given protein is astronomical. For a typical protein sequence of 300 amino acids there are over 10150 codon combinations that will encode an identical protein. Codon optimization, or replacing rarely used codons with more common codons sometimes have dramatic effects. Further optimizations such as removing RNA secondary structures can also be included. At least in the case of E. coli, protein expression is maximized by predominantly using codons corresponding to tRNA that retain amino acid charging during starvation. Computer programs written to perform these, and other simultaneous optimizations are used to handle the enormous complexity of the task. A well optimized gene can improve protein expression 2 to 10 fold, and in some cases more than 100 fold improvements have been reported. Because of the large numbers of nucleotide changes made to the original DNA sequence, the only practical way to create the newly designed genes is to use gene synthesis.
Oligonucleotides are chemically synthesized using building blocks called nucleoside phosphoramidites. These can be normal or modified nucleosides which have protecting groups to prevent their amines, hydroxyl groups and phosphate groups from interacting incorrectly. One phosphoramidite is added at a time, the 5' hydroxyl group is deprotected and a new base is added and so on. The chain grows in the 3' to 5' direction, which is backwards relative to DNA biosynthesis in vivo. At the end, all the protecting groups are removed.
Figure 5.13 Four-step phosphoramidite oligodeoxynucleotide synthesis cycle. The phosphoramidite method, pioneered by Marvin Caruthers in the early 1980s, and enhanced by the application of solid-phase technology and automation, is now firmly established as the method of choice. Phosphoramidite oligonucleotide synthesis proceeds in the 3 to 5 direction (opposite to the 5 to 3 direction of DNA biosynthesis in DNA replication). One nucleotide is added per synthesis cycle. The phosphoramidite DNA synthesis cycle consists of a series of steps outlined in the figure
Nevertheless, being a chemical process, several incorrect interactions occur leading to some defective products. The longer the oligonucleotide sequence that is being synthesized, the more defects there are, thus this process is only practical for producing short sequences of nucleotides. The current practical limit is about 200 bp (base pairs) for an oligonucleotide with sufficient quality to be used directly for a biological application. HPLC can be used to isolate products with the proper sequence. Meanwhile, a large number of oligos can be synthesized in parallel on gene chips. For optimal performance in subsequent gene synthesis procedures they should be prepared individually and in larger scales.
DNA synthesis and synthetic biology
The significant drop in cost of gene synthesis in recent years due to increasing competition of companies providing this service has led to the ability to produce entire bacterial plasmids that have never existed in nature. The field of synthetic biology utilizes the technology to produce synthetic biological circuits, which are stretches of DNA manipulated to change gene expression within cells and cause the cell to produce a desired product.
The ability to synthetically produce DNA will enable the development of environmental, medical, and commercially relevant products. For example, in 2015, Novartis, in collaboration with the Synthetic Genomics Vaccines inc. and the US Biomedical Advanced Research and Development Authority, announced that they had effectively created a synthetic DNA influenza vaccine. New synthetic DNA vaccines hold promise to provide an alternative to current egg-produced conventional vaccines that can be plagued by low efficacy.
DNA vaccines are able to avoid many issues associated with egg-based vaccine production by generating viral proteins within host cells. To create a DNA vaccine, an antigen-encoding gene is cloned into a non-replicative expression plasmid, which is delivered to the host by traditional vaccination routes. Host cells which take up the plasmid express the vaccine antigen which can be presented to immune cells via the major histocompatibility complex (MHC) pathways. CD4+ T helper cell activation following MHC class II presentation of secreted DNA vaccine protein is critical for the production of antigen-specific antibodies (Figure 5.14).
Figure 5.14 Creation of a DNA Vaccine. An antigenic gene is synthesized and cloned into a plasmid vector. (Steps regarding the cloning process are described in more detail in section 5.3). The DNA vaccine is delivered to the host, where it will be expressed to produce and present antigen to the host immune system.
After two decades of research, DNA vaccine technology is gaining maturity—several veterinary DNA vaccines are currently licensed for West Nile virus and melanoma, and significantly, the first commercial DNA vaccine against H5N1 in chickens has recently been conditionally approved by the USDA. In addition, ongoing large animal trials of DNA vaccines against other diseases such as against HIV, hepatitis, and Zika virus offer valuable insights that can be applied to influenza DNA vaccine design. Promising approaches have arisen from the numerous studies evaluating different DNA vaccine formulations and delivery systems, but a strategy that consistently elicits protection against influenza in large animal models has not yet emerged. Successful plasmid delivery and the use of appropriate adjuvants remain key challenges that need to be addressed before influenza DNA vaccines become effective for human use.