1.4: Genetic Foundations
- Page ID
- 14916
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\( \newcommand{\dsum}{\displaystyle\sum\limits} \)
\( \newcommand{\dint}{\displaystyle\int\limits} \)
\( \newcommand{\dlim}{\displaystyle\lim\limits} \)
\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)
( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\id}{\mathrm{id}}\)
\( \newcommand{\Span}{\mathrm{span}}\)
\( \newcommand{\kernel}{\mathrm{null}\,}\)
\( \newcommand{\range}{\mathrm{range}\,}\)
\( \newcommand{\RealPart}{\mathrm{Re}}\)
\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)
\( \newcommand{\Argument}{\mathrm{Arg}}\)
\( \newcommand{\norm}[1]{\| #1 \|}\)
\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)
\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)
\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)
\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)
\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\( \newcommand{\vectorC}[1]{\textbf{#1}} \)
\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)
\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)
\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)
\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)
\(\newcommand{\longvect}{\overrightarrow}\)
\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)
\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)-
Evolution and Natural Selection
- Describe the theory of biological evolution by natural selection and explain how heritable genetic variation drives adaptation over generations.
- Differentiate between genotype and phenotype, and discuss how allelic variation determines phenotypic traits.
-
Molecular Basis of Heredity
- Explain how the chemical structure and mutability of DNA underpin natural selection and evolutionary change.
- Relate the roles of DNA, RNA, and proteins in transmitting and expressing genetic information.
-
The Genetic Code and Its Decipherment
- Describe the triplet nature of codons and explain how codon specificity determines the amino acid sequence in proteins.
- Summarize key experiments (e.g., those by Nirenberg, Matthaei, and Khorana) that led to the elucidation of the genetic code.
-
Central Dogma of Molecular Biology
- Outline the processes of DNA replication, transcription, and translation, and explain how they interconnect to express genetic information.
- Compare the processes in prokaryotes and eukaryotes, with emphasis on the roles of introns and splicing in eukaryotic gene expression.
-
Mutations and Protein Evolution
- Identify different types of mutations (e.g., missense, nonsense, frameshift) and discuss their potential effects on protein function and phenotype.
- Explain the concepts of homologs (orthologs and paralogs) and analogs, and discuss how protein sequence comparisons can shed light on evolutionary relationships.
-
Gene Expression and Regulation
- Discuss how gene expression is regulated at multiple levels, including transcriptional control and epigenetic modifications, and explain the impact on phenotype and evolution.
- Evaluate examples of how environmental factors can induce heritable changes in gene expression without altering the DNA sequence.
-
Modern Molecular Biology Techniques
- Explain the principles behind key DNA manipulation techniques such as restriction enzyme digestion, DNA cloning, PCR, and site-specific mutagenesis.
- Describe the methodologies used in DNA sequencing (e.g., Sanger sequencing and real-time methods) and DNA fingerprinting, and discuss how these techniques have revolutionized biological research and forensic science.
-
Bioinformatics and Comparative Genomics
- Understand how bioinformatic tools are used to compare DNA and protein sequences, infer phylogenetic relationships, and predict gene function.
- Discuss the challenges and limitations of sequence-based gene annotation and the importance of experimental validation.
By achieving these goals, students will develop a comprehensive understanding of the molecular principles underlying evolution, gene expression, and modern genetic technologies, thereby equipping them to engage with advanced topics in biochemistry and molecular biology.
Introduction
The development of complex biological organisms on our planet has arisen through the evolutionary mechanism of natural selection. The British naturalist Charles Darwin proposed the theory of biological evolution by natural selection in his book ‘On the Origin of Species’, published in 1859. Darwin defined evolution as “descent with modification,” the idea that species change over time, give rise to new species, and share a common ancestor. The mechanism that Darwin proposed for evolution is natural selection. Because resources are limited, organisms with heritable traits that favor survival and reproduction will tend to leave more offspring than their peers, increasing the frequency of those traits within a population over generations. Thus, natural selection causes populations to adapt, or become increasingly well-suited, to their environments over time. Natural selection depends on the environment and requires existing heritable variation in a group.
Natural selection acts on an organism’s phenotype or physical characteristics. Phenotype is determined by an organism’s genetic makeup (genotype) and the environment in which the organism lives. When different organisms in a population possess different versions of a gene for a certain trait, each version is known as an allele. It is primarily this genetic variation that underlies differences in phenotype. Only a single gene governs some traits, but the interactions of many genes influence most traits. A variation in one of the many genes contributing to a trait may have only a small effect on the phenotype; together, these genes can produce a continuum of possible phenotypic values.
For example, interactions between equine coat color genes determine a horse’s color. Many colors are possible, but changes in only a few genes produce all variations. Extension and agouti are particularly well-known genes with dramatic effects. For example, differences in the agouti gene can help determine whether a horse is bay or black. A change in the extension gene can, in turn, result in a horse being chestnut-colored instead (Figure 1.30). Yet other gene variants are responsible for many other coat color possibilities, including palomino, buckskin, and cremello horses.
Thus, the primary molecular mechanism driving natural selection is determined by the heritability and mutability of genetic traits encoded in deoxyribonucleic acid (DNA). In Chapter 4, you will learn about the structural characteristics of DNA. In contrast, Chapter 9 focuses on the biochemical mechanisms of DNA replication, detailing the importance of DNA repair and the molecular mechanisms of genetic evolution.
Genetic Code
Notably, the phenotypic traits determined by an organism's genetic makeup are not directly controlled by the genetic material, DNA, but rather by the proteins encoded by the genes. In 1945, geneticist George Beadle proposed the one gene-one enzyme hypothesis, suggesting that genes are highly specific for encoding a protein sequence. However, it would take 16 more years before the biochemical nature of this process was deduced. Efforts to understand how proteins are encoded began after DNA’s structure was discovered in 1953. George Gamow postulated that sets of three bases must be employed to encode the 20 standard amino acids living cells use to build proteins, allowing a maximum of 43 = 64 amino acids.
The Crick, Brenner, Barnett, and Watts-Tobin experiment first demonstrated that codons have three DNA bases (Figure 1.31). Marshall Nirenberg and Heinrich J. Matthaei were the first to reveal the nature of a codon in 1961.
Figure 1.31 Codons Consist of Sets of Three Bases. A series of codons in part of a messenger RNA (mRNA) molecule. Each codon consists of three nucleotides, usually corresponding to a single amino acid. The nucleotides are abbreviated with the letters A, U, G, and C. This is mRNA, which uses U (uracil). DNA uses T (thymine) instead. This mRNA molecule will instruct a ribosome to synthesize a protein according to this code. Image by Sverdrup
They used a cell-free system to translate a poly-uracil RNA sequence (i.e., UUUUU…) and discovered that the polypeptide they had synthesized consisted only of the amino acid phenylalanine. Thus, they deduced that the codon UUU specified this amino acid.
This was followed by experiments in Severo Ochoa‘s laboratory that demonstrated that the poly-adenine RNA sequence (AAAAA…) coded for the polypeptide poly-lysine and that the poly-cytosine RNA sequence (CCCCC…) coded for the polypeptide poly-proline. Therefore, the codon AAA specified the amino acid lysine, and the codon CCC specified the amino acid proline. Using various copolymers, most of the remaining codons were determined.
Subsequent work by Har Gobind Khorana identified the rest of the genetic code. Shortly thereafter, Robert W. Holley determined the structure of transfer RNA (tRNA), the adapter molecule that facilitates the translation of RNA into protein. This work was based upon Ochoa’s earlier studies, yielding the latter the Nobel Prize in Physiology or Medicine in 1959 for work on the enzymology of RNA synthesis.
Extending this work, Nirenberg and Philip Leder revealed the triplet nature of the code and deciphered its codons (Figure 1.32). In these experiments, various combinations of mRNA were passed through a filter containing ribosomes, the cellular components that translate RNA into protein. Unique triplets promoted the binding of specific tRNAs to the ribosome. Leder and Nirenberg determined the sequences of 54 out of 64 codons. Khorana, Holley, and Nirenberg received the 1968 Nobel Prize for their work.
The three stop codons were named by discoverers Richard Epstein and Charles Steinberg. “Amber” was named after their friend Harris Bernstein, whose last name means “amber” in German. The other two stop codons were named “ochre” and “opal” to keep the “color names” theme.
Each gene contains a reading frame defined by the initial triplet of nucleotides from which translation starts. It sets the frame for a run of successive, non-overlapping codons, known as an open reading frame (ORF). For example, the string 5′-AAATGAACG-3′, if read from the first position, contains the codons AAA, TGA, and ACG; if read from the second position, it contains the codons AAT and GAA; and if read from the third position, it contains the codons ATG and AAC. Every sequence can, thus, be read in its 5′ → 3′ direction in three reading frames, each producing a possibly distinct amino acid sequence: in the given example, Lys (K)-Trp (W)-Thr (T), Asn (N)-Glu (E), or Met (M)-Asn (N), respectively. When DNA is double-stranded, six possible reading frames are defined: three in the forward orientation on one strand and three in reverse on the opposite strand. A start codon defines the protein-coding frame, typically the first AUG (ATG) codon in the RNA (or DNA) sequence.
Three stop codons terminate translation: UAG is amber, UGA is opal (sometimes also called ochre), and UAA is is ochre. Stop codons are also called “termination” or “nonsense” codons. They signal the release of the nascent polypeptide from the ribosome.
During DNA replication, errors occasionally occur during the polymerization of the second strand. These errors, called mutations, can affect an organism’s phenotype, especially if they occur within a gene's protein-coding sequence. Error rates are typically one error in every 10–100 million bases due to the “proofreading” ability of DNA polymerases.
Missense mutations and nonsense mutations are types of point mutations that can cause genetic diseases, such as sickle-cell disease and thalassemia, respectively. Clinically important missense mutations generally change the properties of the coded amino acid residue among basic, acidic, polar, or non-polar states, whereas nonsense mutations result in a stop codon.
Mutations that disrupt the reading frame by indels (insertions or deletions) of non-multiple-of-3 nucleotide bases are known as frameshift mutations. These mutations typically result in a completely different translation from the original RNA and likely cause a stop codon to be read, thereby truncating the protein. These mutations may impair the protein’s function and are thus rare in in vivo protein-coding sequences. One reason inheritance of frameshift mutations is rare is that if the protein being translated is essential for growth under the organism's selective pressures, the absence of a functional protein may cause death before the organism becomes viable. Frameshift mutations can result in severe genetic diseases, such as Tay–Sachs disease.
Although most mutations that alter protein sequences are either harmful or neutral, some are beneficial. These mutations may enable the mutant organism to withstand particular environmental stresses better than wild-type organisms or reproduce more quickly. In these cases, a mutation will become more common in a population through natural selection. Different sequence variations of the same gene or protein within a single organism, within a population, or between different species are known as sequence polymorphisms. Larger-scale gene duplication events can also drive evolutionary change.
Similar Proteins
The evolution of proteins is studied by comparing the sequences and structures of proteins from many organisms representing distinct evolutionary clades. If the sequences/structures of two proteins are similar, indicating that they diverged from a common origin, they are called homologous proteins. More specifically, homologous proteins in two distinct species are called orthologs. In contrast, homologous proteins encoded by a single species's genome are referred to as paralogs. Unrelated genes with separate evolutionary origins that each encode proteins with similar functions are termed analogs (Figure 1.33).
Figure 1.33 Genetic Evolution of Protein Sequences. (Upper Panel) An ancestral gene duplicates to produce two paralogs (Gene A and B). A speciation event produces orthologs in the two daughter species. In a separate species, an unrelated gene (Gene C) has a similar function but a distinct evolutionary origin, and thus it is an analog. (Lower Panel) 3-D protein models were retrieved or modeled using SWISS-MODEL: Human Histone H1.1 (Q02539), Human Histone H1.2 (P16403), E. coli HNS (P0ACF8). Histone H1.1 from the chimpanzee (Pan troglodytes XP_016810512.1) was modeled using Human Histone H1.1 as a template. Note that the E. coli HNS protein is typically modeled as a dimer. Only a single monomer is shown here. Upper Image by Thomas Shafee
DNA sequencing techniques have rapidly improved over the last 15 to 20 years, making it possible to sequence the entire genomes of organisms and, thus, predict the entire proteome of an organism based on the translation of the sequenced genome, followed by the annotation of predicted ORFs using phylogenetic comparison of similar genes/proteins from other known organisms. This has given rise to the field of Bioinformatics, which utilizes computer science, mathematics, and statistical analysis to analyze the vast quantities of biological data generated in genome sequencing projects. The phylogenetic relationships, and hence ancestral relationships, of various genes, proteins, and, ultimately, organisms can be established through the statistical analysis of sequence alignments. Such phylogenetic trees have shown that protein sequence similarities closely reflect evolutionary relationships among organisms.
Protein evolution refers to the changes in protein shape, function, and composition that occur over time. Through quantitative analysis and experimentation, scientists have strived to understand the rate and causes of protein evolution. Scientists could estimate protein evolution rates by using the amino acid sequences of hemoglobin and cytochrome c from multiple species. They found that the rates varied among proteins. Each protein has its own rate, which remains constant across phylogenies (i.e., hemoglobin evolves at a different rate than cytochrome c, but hemoglobins from humans, mice, and other species have comparable rates of evolution). Not all regions within a protein mutate at the same rate; functionally important areas mutate more slowly, and amino acid substitutions involving similar amino acids occur more often than dissimilar substitutions. Overall, the level of polymorphisms in proteins seems to be fairly constant. Several species (including humans, fruit flies, and mice) have similar levels of protein polymorphism.
Gene duplication events, followed by mutation, can also give rise to paralogs with distinct functions within an organism. This can make it difficult to annotate genomes based solely on sequence, as homologous protein sequences may not have similar functions in vivo. Approximately 10-25% of annotations made on sequence homology are estimated to be incorrect and require experimental validation. For example, human pancreatic ribonuclease is a digestive enzyme that breaks down nucleic acids. The angiogenin protein is a paralog of pancreatic ribonuclease and shares high sequence homology and a similar 3D shape (Figure 1.34). However, the functions of these proteins are quite different. Angiogenin induces vascularization by activating transcriptional processes in endothelial cells. However, if the function of only one of these homologs were known, it would be easy to mistakenly hypothesize that the homologous protein would be similar in function. Thus, care must be taken when using bioinformatic tools to avoid overestimating the predictive ability of sequence alignments.
The control of gene expression is crucial to all life processes, enabling the differentiation of cells to form various body structures and organs, as well as smaller, more reversible changes that allow an organism to respond to different environmental stimuli. In Chapter 12, you will explore the major biochemical mechanisms used to control gene expression within cells. This will include the discussion of a fairly new and exciting field of study known as epigenetics. In addition to the heritability of traits through the passage of genetic information, it is fast becoming clear that the environmental factors that an organism is exposed to throughout its life can affect gene expression without physically altering the DNA sequence, and that these changes in expression patterns can be long-lasting and can even be inherited in the following generations. The term epigenetics means ‘on top of’ or ‘in addition to’ genetics and focuses on the heritable gene expression patterns induced by an organism's exposure or experience within its environment.
For example, in human populations, stressful events such as starvation can have lasting imprints on children who are born under these conditions. These children have higher risks of obesity and metabolic disorders as adults, including the development of type 2 diabetes. These predispositions can be passed to children born during starvation and their future offspring, indicating that environmental events can influence gene expression patterns across multiple generations. In more controlled laboratory experiments using rats, it has been demonstrated that the more a mother rat licks and nurtures her offspring, the calmer and more relaxed the offspring will be as adults. Mother rats that are less nurturing and ignore their young have offspring that will grow up displaying higher levels of anxiety. These changes are not caused by genetic differences between the offspring but rather by differences in gene expression patterns. Calm and relaxed mice can be altered to show high anxiety by exposing them to agents that alter gene expression patterns. A future chapter will cover the mechanisms that control such heritable alterations in gene expression patterns.
Central Dogma of Biology
DNA encodes the genetic material. It must be replicated during cell division. In transcription, the information is decoded into RNA, which is then translated into a protein sequence. Collectively, these processes are referred to as the Central Dogma of Biology. A variant occurs when RNA is decoded into DNA, a process called reverse transcription. These processes are briefly described below and further detailed in subsequent chapters.
Replication
DNA must be duplicated, a process called replication, before a cell divides. DNA replication allows each daughter cell to inherit a full complement of chromosomes.
Transcription and Splicing
For a given gene, only one strand of the DNA serves as the template for transcription. An example is shown below. In this example, the bottom (blue) strand serves as the template strand, also known as the minus (-) strand or the sense strand. This strand serves as a template for mRNA synthesis. The enzyme RNA polymerase synthesizes an mRNA in the 5' to 3' direction complementary to this template strand. The opposite DNA strand (red) is called the coding strand, the nontemplate strand, the plus (+) strand, or the antisense strand.
The easiest way to find the corresponding mRNA sequence (shown in green below) is to read the coding, nontemplate, plus (+), or antisense strand directly in the 5' to 3' direction, substituting U for T.

As we've learned more about the structure of DNA, RNA, and proteins, it has become clear that transcription and translation differ in eukaryotes and prokaryotes. Specifically, eukaryotes have intervening sequences of DNA (introns) within a given gene that separate coding fragments of DNA (exons). A primary transcript is made from the DNA, and the introns are sliced out, and the exons join in a contiguous stretch to form messenger RNA, which leaves the nucleus. Translation occurs in the cytoplasm. Remember, prokaryotes do not have a nucleus.
Translation
Information in an mRNA sequence is decoded to form a protein. In this process, a triplet of nucleotides (a codon) in RNA encodes a single amino acid. Translation occurs on a large RNA-protein complex called the ribosome. An intermediary transfer RNA (tRNA) molecule becomes covalently linked to a single amino acid by the enzyme tRNA-acyl synthetase. This "charged" tRNA binds via its complementary anticodon to the triplet codon in the mRNA. The ribosome/tRNA complex ratchets down the mRNA, allowing a new "charged" tRNA complex to bind at an adjacent site. The two adjacent amino acids form a peptide bond driven by ATP cleavage. This process repeats until a "stop" codon appears in the mRNA sequence. The genetic code reveals the relationship between the triplet mRNA codon and the corresponding amino acid in the growing peptide chain.
As mentioned in the Protein Chapter (amino acid section), two other amino acids occasionally appear in proteins (excluding those altered by post-translational modification). One is selenocysteine (Sec), found in Archaea, eubacteria, and animals. The other is pyrrolysine (Pyl), found in Archaea.
- Selenocysteine (Sec): First, the amino acid serine is attached to a tRNA specific for Sec to form Ser-tRNASec. In bacteria, an enzyme (selenocysteine synthase, or SelA) uses selenophosphate to donate selenium. In Archae and Eukaryotes, an enzyme (O-phosphoseryl-tRNA kinase or PSTK) phosphorylated the -OH on Ser. Then the enzyme O-phosphoseryl-tRNA:selenocysteinyl-tRNA synthase (SepSecS) (or selenocysteine synthase, SecS) converts the O-phosphoseryl-tRNA to form the Sec. Selenocysteine-tRNA recognizes UGA, another stop codon. This usual tRNA complex would recognize only a small fraction of stop codons in the mRNA sequence group.
- Pyrrolysine (Pyl): First, pyrrolysine is synthesized using three enzymes (PylB, PylC, and PylD) using two molecules of L-lysine. Then, pyrrolsine is attached directly to its tRNA by the enzyme pyrrolysyl-tRNA synthetase (PylRS). The tRNAPyl has a CUA anticodon, which recognizes the UAG (amber) stop codon in mRNA.
What is a gene?
The definition of a gene can differ depending on whom you ask. The word "gene" has become a cultural icon of our time. Can our genes explain what it is to be human? The definition of a gene has changed with time. Eukaryotic genes contain exons (coding regions) and introns (intervening sequences) that are transcribed to produce a primary transcript. In a post-transcriptional process, introns are spliced out by the spliceosome to produce a messenger RNA (mRNA), which is translated into a protein sequence. (See diagram above.)
Over the last 100 years, as our understanding of biochemistry has increased, the definition of a gene has evolved from
- the basis of inheritable traits
- certain regions of chromosomes
- a segment of a chromosome that produces one enzyme
- a segment of a chromosome that produces one protein
- a segment of a chromosome that produces a functional product
The last definition was necessary because some gene products with functional roles (structural and catalytic) are RNA molecules. The last definition also includes regulatory regions of the chromosome involved in transcriptional control. Snyder and Gerstein have developed five criteria for gene identification, which is important as organisms' complete genomes are analyzed for genes.
- identification of an open reading frame (ORF) - this would include a series of codons bounded by start and stop codons. This becomes progressively more challenging if the gene has many exons embedded in long introns.
- specific DNA features within genes - these would include a bias towards certain codons found in genes or splice sites (to remove intron RNA)
- comparing putative gene sequences for homology with known genes from different organisms, but avoiding sequences that might be conserved regulatory regions.
- identification of RNA transcripts or expressed protein (which does not require DNA sequence analysi,s as the top three steps do) -
- inactivating (chemically or through specific mutagenesis) a gene product (RNA or protein).
New findings make it even more complicated to define a gene, especially if the transcripts of a "gene region" are studied. Cheng et al. studied all transcripts from 10 human chromosomes and 8 cell lines. They identified a substantial number of distinct transcripts, many of which overlapped. Splicing often occurs between nonadjacent introns. Transcripts were found from both strands and were from regions containing introns and exons. Other studies found up to 5% of transcripts continued through the end of the "gene" into other genes. 63% of the entire mouse genome, comprised of only 2% exons, is transcribed.
The Language of DNA
In this short chapter, you will briefly learn how modern molecular biologists manipulate DNA, the blueprint for all of life. The details will be found in subsequent chapters. The four-letter alphabet (A, G, C, and T) that makes up DNA represents a language that, when transcribed and translated, leads to the myriad of proteins that make us who we are as a species and as individuals. Let's continue with the metaphor that DNA is a language. To master that language, as with any other language, we need to be able to read, write, copy, and edit that language. If you were using a word processor to find one line in a hundred-page document or one article from one book out of the Library of Congress, you would also need a way to search the large print base available. You might want to compare two copies of a file to see if they differ. Scientists can read, write, copy, edit, search, and compare the language of the genome. These abilities, acquired over the last twenty years, have revolutionized our understanding of life and given us the potential to alter life for good or evil.
DNA in human chromosomes exists as one long double-stranded molecule. It is too long to study and manipulate physically in the lab. Using a battery of enzymes, the DNA of chromosomes can be chemically cleaved into smaller fragments, which are more readily manipulable. (Similar techniques are used to sequence proteins, which require overlapping polypeptide fragments to be made.) After the fragments are created, they must be separated to be studied. DNA fragments can be separated based on a structural feature that distinguishes them. Polarity cannot be used, since all DNA fragments have negatively charged phosphate groups in the sugar-phosphate backbone. Although each fragment would have a unique sequence, it would be hard to separate all the different fragments by, for instance, attaching some molecule that binds to a unique sequence in the major groove of a given fragment to a big bead and using that bead to separate that one unique fragment. You would need a different bead for each unique fragment! The best way to separate fragments is to base separation on fragment size using electrophoresis on an agarose or polyacrylamide gel.
A carbohydrate extract called agarose is made from algae. Water is added to the extract, which is then heated. The carbohydrate extract dissolves in the water to form a viscous solution. The agarose solution is poured into a mold (similar to warm Jell-O) and allowed to solidify. A plastic comb with wide teeth was placed in the agarose when it was still liquid. When the agarose is solid, the comb can be removed, leaving little wells in its place. A solution of DNA fragments can be placed in the wells. The agarose slab with the sample is covered with a buffer solution, and electrodes are placed at each end of the slab. The negative electrode is placed near the well-end of the agarose slab, while the positive electrode is placed at the other end. If a voltage is applied across the agarose slab, the negatively charged DNA fragments will move through the agarose gel toward the positive electrode. This migration of charged molecules in solution toward an oppositely charged electrode is called electrophoresis. Pretend you are one of the fragments.
To you, the gel looks like a tangled cobweb. You sneak through the openings in the web as you move straight forward to the positive electrode. The larger the fragment, the slower you move because getting through the tangled web is hard. Conversely, the shorter the fragment, the faster you move. Using this technique and its many variants, oligonucleotides differing by just one nucleotide can be separated. In electrophoresis of DNA fragments, a fluorescent, uncharged dye, ethidium bromide, is added to the buffer solution. This dye intercalates between DNA base pairs, imparting a fluorescent yellow-green color to the DNA when exposed to UV light on an agarose gel.
Reading DNA
We will discuss one method for reading the DNA sequence. This method, developed by Sanger, won him a second Nobel Prize. To sequence a single-stranded piece of DNA, the complementary strand is synthesized. Four different reaction mixtures are set up. Each contains all four radioactive deoxynucleotides (dATP, dCTP, dGTP, and dTTP) required for the reaction, as well as DNA polymerase.
Additionally, dideoxyATP (ddATP) is added to one of the reaction tubes. The dATP and ddATP attach randomly to the growing 3' end of the complementary strand. If ddATP is added, no further nucleotides can be added, since its 3' end has an H rather than an OH. That's why it's called dideoxy. The new chain is terminated. If dATP is added, the chain will continue growing until another A nucleotide is needed. Hence, a series of discrete DNA fragments will be generated, all terminated upon addition of ddATP. The same scenario occurs for the other three tubes, which contain dCTP and ddCTP, dTTP and ddTTP, and dGTP and ddGTP, respectively. All the fragments generated in each tube will be placed in separate lanes for electrophoresis, where they will separate by size.
Didexoynucleotides
Figure: Didexoynucleotides

PROBLEM: As shown below, you will pretend to sequence a single-stranded piece of DNA. The enzyme DNA polymerase adds the new nucleotides to the primer, GACT, in the 5' to 3' direction. You will set up four reaction tubes. Each tube contains all the dXTPs. In addition, add ddATP to tube 1, ddTTP to tube 2, ddCTP to tube 3, and ddGTP to tube 4. For each separate reaction mixture, determine all the possible sequences made by writing the possible sequences on one of the unfinished complementary sequences below. Cut the completed sequences from the page, choose the size of the polynucleotide sequences made, and place them as they would migrate (based on size) in the appropriate lane of an imaginary gel, which you have drawn on a piece of paper. Lane 1 will contain the nucleotides produced in tube 1, and so on. Then draw lines under the positions of the cutout nucleotides to represent DNA bands in the gel. Read the sequence of the complementary DNA synthesized. Then write the sequence of the ssDNA that was to be sequenced.
5' T C A A C G A T C T G A 3' (STAND TO SEQUENCE)
3' G A C T 5' (primer)
3' G A C T 5' (primer)
3' G A C T 5' (primer)
3' G A C T 5' (primer)
3' G A C T 5' (primer)
3' G A C T 5' (primer)
3' G A C T 5' (primer)
3' G A C T 5' (primer)
Since the DNA fragments have no detectable color, they can not be directly visualized in the gel. Alternative methods are used. In the one described above, radiolabeled ddXTPs were used. Once the sequencing gel is run, it can be dried, and the bands can be visualized by radioautography (also called autoradiography). An X-ray film is placed over the dried gel in a dark environment. The radiolabeled bands will emit radiation, exposing the X-ray film directly above them. The film can be developed to detect the bands. The primer can be labeled with a fluorescent dye using a newer technique. If a different dye is used for each reaction mixture, all the mixtures can be run in a single gel lane. (Only one reaction mix containing all the ddXTPs together is performed.) The gel can then be scanned by a laser, which detects fluorescence from the dyes at different wavelengths.
Figure: DNA sequencing using different fluorescent primers for each ddXTP reaction
A recent advance in sequencing enables the real-time determination of a sequence. The four deoxynucleotides are each labeled with a different fluorophore on the 5' phosphate (not the base as above). A tethered DNA polymerase elongates the DNA on a template, releasing the fluorophore into solution (i.e., the fluorophore is not incorporated into the DNA chain). The reaction occurs in a visualization chamber called a zero-mode waveguide, a cylindrical metallic chamber with a width of 70 nm and a volume of 20 zeptoliters (20 x 10-21 L). It sits on a glass support, through which the sample is illuminated with a laser. Given the small volume, non-incorporated fluorescently tagged deoxynucleotides diffuse in and out on the microsecond timescale. When a deoxynucleotide is incorporated into DNA, its residence time is on the millisecond timescale. This enables prolonged fluorescence detection, resulting in a high signal-to-noise ratio. Newer technologies that sequence DNA by moving it through pores in membranes could bring sequencing down to $1000/genome or less.
Writing DNA
Oligonucleotides can be synthesized on a solid bead. Adding one nucleotide at a time allows control over the oligonucleotide's sequence and length.
Copying DNA
Several methods exist for copying a DNA sequence millions of times. Most methods use plasmids (found in bacteria) and viruses (which can infect any cell). The DNA of the plasmid or virus is engineered to contain a copy of a specific DNA sequence of interest. The plasmid or virus is then reintroduced into the cell, where it undergoes amplification.
Initially, DNA containing a gene or regulatory sequence of interest is cut at specific places with an enzyme called a restriction endonuclease, or restriction enzyme for short. The enzyme doesn't cleave DNA anywhere but at "restricted" places in the sequence, much as an endoprotease cleaves a protein after a given amino acid within a protein chain. The restriction endonuclease must cleave both strands of double-stranded DNA (dsDNA). It can cut the strands cleanly to leave blunt ends or, in a staggered fashion, to leave small ssDNA tails. Multiple such sites are scattered randomly throughout the genome. The gene of interest must be flanked on either side by such a sequence. The same enzyme cleaves the DNA of plasmids or viruses.
Figure: Cleaving DNA with the Restriction Enzyme EcoR1

The foreign DNA fragment can then be added to the plasmid or viral DNA to form a recombinant DNA molecule. This DNA cloning technique is the basis for the entire field of recombinant DNA technology.
Figure: Cloning a Restriction Fragment into a Plasmid

The plasmid can be introduced into bacteria, which then take it up through a process called transformation. The plasmid can be replicated in the bacteria, which will copy the DNA fragment of interest. Typically, the plasmid carries a gene that confers antibiotic resistance in the bacteria. Only bacteria that carry the plasmid (presumably the insert) will be able to grow. To isolate the desired fragment, the plasmids are isolated from bacteria and cleaved with the same restriction enzyme, after which the fragment can be purified. Additionally, the bacteria can be induced to express the protein encoded by the foreign gene. In lab 4, we will transform bacteria with a plasmid containing the gene for human adipocyte acid phosphatase beta, HAAP-B, and induce gene expression.
A similar method can be used to copy DNA, in which the foreign fragment is recombined with the DNA of a bacteriophage, a virus that infects bacteria, such as E. coli. The recombinant DNA can then be packaged into viruses, as shown below. When the virus infects the bacteria, it instructs the cells to make millions of new viruses, copying the foreign fragment of interest.
Sometimes, "cloning" or copying a fragment of DNA is not what an investigator wants. For instance, if the genomic DNA comes from a human cell, the gene will contain introns. If you put this DNA into a plasmid or bacteriophage, the introns go with it. Bacteria can replicate this DNA, but often, one wants not just to copy (amplify) the DNA but also to transcribe it into RNA and then translate it into protein. Bacteria, however, cannot splice out the intron, so mature mRNA cannot be produced. If one could clone into the bacteria's DNA without the introns, this problem would not exist. One such possible method involves starting with the actual mRNA for a protein of interest. In this technique, a dsDNA copy is made from an ss-mRNA molecule. Such dsDNA is called cDNA, for complementary or copy DNA. This can then be cloned into a plasmid or bacteriophage vector and amplified as described above.
In the mid-'80s, a new method was developed to copy (amplify) DNA in a test tube. It doesn't require a plasmid or a virus. It requires only a DNA fragment and some primers (small oligonucleotides complementary to sections of DNA on each strand that straddle the section to be amplified). Add dATP, dCTP, dGTP, dTTP, and a heat-stable DNA polymerase from the organism Thermus aquaticus (which inhabits hot springs), and you're ready to proceed. The mixture is first heated to a temperature that separates the dsDNA strands. The temperature is lowered, allowing a large stoichiometric excess of the primers to anneal to the ssDNA. Heat-stable Taq polymerase (from Thermus aquaticus) extends primers. The temperature is raised again, allowing the dsDNA strand to separate. Upon cooling, the primers anneal to both the original and the newly synthesized DNA from the previous cycle, and DNA synthesis resumes. This cycle is repeated, as shown in the diagram. This chain reaction is called the polymerase chain reaction (PCR). The target DNA synthesized is amplified a million times in 20 cycles or a billion times in 30 cycles, which can be done in a few hours.

Editing DNA
We will spend considerable time discussing how specific amino acids can be covalently modified to identify the presence of a particular amino acid or to modify the protein's activity. It is routine to use recombinant DNA technology to alter one or more nucleotides, thereby changing the amino acid or adding or deleting one or more amino acids. This technique, known as site-specific mutagenesis, is widely employed by protein chemists to assess the significance of a specific amino acid in a protein's folding, structure, and function. The techniques are described in the diagram below;

Searching DNA
Where on a chromosome is the gene that codes for a given protein? One way to find the gene is to synthesize a small oligonucleotide "probe" complementary to part of the gene's actual DNA sequence (determined from previous experiments). Attach a fluorescent molecule to the DNA probe. Then, use a cell preparation in which the chromosomes can be seen under the microscope. The base is added, which unwinds the double-stranded DNA helix. A fluorescent probe is added that binds to the chromosome at the site of the gene complementary to the DNA. Hybridization is the process whereby a single-stranded nucleotide sequence (the target) binds through H-bonds to another complementary nucleotide sequence (the probe).
What if you don't know the nucleotide sequence of the gene but know the protein's amino acid sequence, as in the example shown below? From the genetic code table, you could predict the sequence of all possible RNA molecules complementary to the DNA in the gene. Since some amino acids have more than one codon, many possible sequences of DNA could code for the protein fragment. The link below lists all corresponding mRNA sequences that could encode a short amino acid sequence. The 20-mer sequences of minimal nucleotide degeneracy should be used as genomic probes.
Comparing DNA
The DNA sequence of each individual must be unique to that individual, differing from every other individual in the world (except identical twins). The difference must be less than the differences between a human and a chimp, which are 98.5 % identical. Let us say that each of us has DNA sequences that are 99.9% identical to those of some "normal humans". Given that we have about 4 billion base pairs of DNA, we are all different by about 0.001 x 4,000,000,000, or about 4 million base pairs. This means that, on average, we have one nucleotide difference for each 1000 DNA base pairs. Some of these are encoded in genes, but most are likely located between DNA, and many are clustered in areas of highly repetitive DNA at the ends of chromosomes (called telomeres) and in the middle (called centromeres).
Remember that restriction enzyme sites are also interspersed randomly along the DNA. If differences in DNA among individuals occur within sequences cleaved by restriction enzymes, then in some individuals a particular enzyme may not cleave at the usual site but instead at a more distal site. Hence, the sizes of the restriction enzyme fragments should differ among individuals. When cut with a battery of restriction enzymes, each person's DNA should yield a unique set of DNA fragments, each with a size specific to that individual. Each person's DNA has a unique Restriction Fragment Length Polymorphism (RFLP). How could you detect such a polymorphism?
You know how to cut sample DNA with restriction enzymes and separate the fragments on an agarose gel. However, an additional step is required, as thousands of fragments could appear on the gel, resulting in a single large continuous smear. If, however, each fragment could be reacted with a set of small, radioactive DNA probes that are complementary to certain highly polymorphic sections of DNA (like telomeric DNA) and then visualized, only a few sets of discrete bands would be observed in the agarose gel. These discrete bands would differ from the DNA bands seen in another individual's gene, which would be treated similarly. This technique, known as Southern Blotting, works as shown below. DNA fragments are separated by electrophoresis in an agarose gel. The dsDNA fragments are unwound by heating, then a piece of nitrocellulose filter paper is placed on top of the gel, and the DNA from the gel transfers to the filter paper. Then, a small radioactive oligonucleotide probe complementary to a polymorphic site on the DNA is added to the paper. It binds only to the fragment containing DNA complementary to the probe. The filter paper is dried, and a piece of X-ray film is placed over the sheet. A set of radioactive fragments (not complementary to the probe) is also run. They serve as markers to ensure that gel electrophoresis and transfer to filter paper were performed correctly.
When this technique is applied in forensic cases or paternity cases, it is referred to as DNA fingerprinting. With current techniques, investigators can unequivocally state that the odds of a particular pattern not belonging to a suspect are 1 in 1,000,000. The X-ray film shown below is a copy of real forensic evidence obtained from a rape case. The Southern blot results from suspect 1, suspect 2, the victim, and the forensic evidence are shown. Analyze the data.

Summary
This chapter weaves together fundamental concepts in molecular evolution and modern genetic technologies to provide a comprehensive view of how biological information is stored, transmitted, and modified. It begins by revisiting Darwin’s theory of evolution by natural selection, emphasizing that genetic variation—arising from mutations in DNA—is the engine driving the adaptation and diversification of species. The relationship between genotype and phenotype is explored through examples such as variations in equine coat color, illustrating how specific alleles combine to produce a continuum of observable traits.
The chapter then delves into the molecular basis of heredity, detailing the discovery and decipherment of the genetic code. It highlights key experiments that established the triplet nature of codons and their role in specifying the 20 standard amino acids, thereby forming the basis for protein synthesis. This discovery not only solidified the concept of the Central Dogma—where DNA is replicated, transcribed into RNA, and translated into protein—but also underscored the importance of non-protein-coding sequences and regulatory regions in complex organisms.
Moving forward, the text examines the mechanisms and consequences of mutations, including point mutations (missense and nonsense) and frameshift mutations. It discusses how these genetic alterations can impact protein function and lead to disease. The chapter also distinguishes between homologous proteins (orthologs and paralogs) and analogs, providing insight into protein evolution and the use of sequence comparisons to reconstruct phylogenetic relationships.
A significant portion of the chapter is devoted to modern molecular biology techniques that have revolutionized our ability to read, copy, edit, and compare DNA. Detailed explanations are provided for methods such as restriction enzyme digestion, DNA cloning, and polymerase chain reaction (PCR), as well as advanced DNA sequencing techniques—including Sanger sequencing and real-time nanopore sequencing. Additionally, the chapter covers DNA fingerprinting and Southern blotting as powerful tools for genetic analysis and forensic applications.
In summary, this chapter integrates evolutionary theory with molecular mechanisms and technological advances, equipping students with a deep understanding of how genetic information is encoded, maintained, and manipulated—knowledge that is crucial for exploring the frontiers of biochemistry and molecular biology.




