Skip to main content
Biology LibreTexts

Lecture 14: Mutations: Genotype to Phenotype


Errors occurring during DNA replication are not the only way by which mutations can arise in DNA. Mutations, variations in the nucleotide sequence of a genome, can also occur because of physical damage to DNA. Such mutations may be of two types: induced or spontaneous. Induced mutations are those that result from an exposure to chemicals, UV rays, x-rays, or some other environmental agent. Spontaneous mutations occur without any exposure to any environmental agent; they are a result of spontaneous biochemical reactions taking place within the cell.

Mutations may have a wide range of effects. Some mutations are not expressed; these are known as silent mutationsPoint mutations are those mutations that affect a single base pair. The most common nucleotide mutations are substitutions, in which one base is replaced by another. These can be of two types, either transitions or transversions. Transition substitution refers to a purine or pyrimidine being replaced by a base of the same kind; for example, a purine such as adenine may be replaced by the purine guanine. Transversion substitution refers to a purine being replaced by a pyrimidine, or vice versa; for example, cytosine, a pyrimidine, is replaced by adenine, a purine. Mutations can also be the result of the addition of a nucleotide, known as an insertion, or the removal of a base, also known as deletion. Sometimes a piece of DNA from one chromosome may get translocated to another chromosome or to another region of the same chromosome; this is known as translocation.  

As we will visit later, when a mutation occurs in a protein coding region it may have several effects. Transition or transversion mutants may lead to no change in the protein sequence (known as silent mutations), change the amino acid sequence (known as missense mutations), or create what is known as a stop codon (known as a nonsense mutation). Insertions and deletions in protein coding sequences lead to what are known as frameshift mutations. Missense mutations that lead to conservative changes results in the substitution of similar but not identical amino acids. For example, the acidic amino acid glutamate being substituted for the acidic amino acid aspartate would be considered conservative. In general we do not expect these types of missense mutations to be as severe as a non-conservative amino acid change; such as a glutamate substituted for a valine. Drawing from our understanding of functional group chemistry we can correctly infer that this type of substitution may lead to severe functional consequences, depending upon location of the mutation.

Note: Vocabulary Watch

Note that the preceding paragraph had a lot of potentially new vocabulary - it would be a good idea to learn these terms.

Figure 1. Mutations can lead to changes in the protein sequence encoded by the DNA.

Suggested discussion

Based on your understanding of protein structure, which regions of a protein would you think are more sensitive to substitutions, even conserved amino acid substitutions? Why?

Suggested discussion

A insertion mutation that results in the insertion of three nucleotides is often less deleterious than a mutation that results in the insertion of one nucleotide. Why?


Mutations: Some nomenclature and considerations


Etymologically speaking, the term mutation simply means a change or alteration. In genetics, a mutation is a change in the genetic material - DNA sequence - of an organism. By extension, a mutant is the organism in which a mutation has occurred. But what is the change compared to? The answer to this question, is that it depends. The comparison can be made with the direct progenitor (cell or organism) or to patterns seen in a population of the organism in question. It mostly depends on the specific context of the discussion. Since genetic studies often look at a population (or key subpopulations) of individuals we begin by describing the term "wild-type".

Wild Type vs Mutant

What do we mean by "wild type"? Since the definition can depend on context, this concept is not entirely straightforward. Here are a few examples of definitions you may run into:

Possible meanings of "wild-type"

  1. An organism having an appearance that is characteristic of the species in a natural breeding population (i.e. a cheetah's spots and tear-like dark streaks that extend from the eyes to the mouth). 
  2. The form or forms of a gene most commonly occurring in nature in a given species. 
  3. A phenotype, genotype, or gene that predominates in a natural population of organisms or strain of organisms in contrast to that of natural or laboratory mutant forms. 
  4. The normal, as opposed to the mutant, gene or allele.

The common thread to all of the definitions listed above is based on the "norm" for a set of characteristics with respect to a specific trait compared to the overall population. In the "Pre-DNA sequencing Age" species were classified based on common phenotypes (what they looked like, where they lived, how they behaved, etc.). A "norm" was established for the species in question. For example, Crows display a common set of characteristics, they are large, black birds that live in specific regions, eat certain types of food and behave in a certain characteristic way. If we see one, we know its a crow based on these characteristics. If we saw one with a white head, we would think that either it is a different bird (not a crow) or a mutant, a crow that has some alteration from the norm or wild type.  

In this class we take what is common about those varying definitions and adopt the idea that "wild type" is simply a reference standard against which we can compare members of a population.

Suggested discussion

If you were assigning wild type traits to describe a dog, what would they be? What is the difference between a mutant trait and variation of a trait in a population of dogs? Is there a wild type for a dog that we could use as a standard? How would we begin to think about this concept with respect to dogs?

Figure 2. Mutations can lead to changes in the protein sequence encoded by the DNA that then impact the outward appearance of the organism. 

Mutations are simply changes from the "wild type", reference or parental sequence for an organism. While the term "mutation" has colloquially negative connotations we must remember that change is neither inherently "bad". Indeed, mutations (changes in sequences) should not primarily be thought of as "bad" or "good", but rather simply as changes and a source of genetic and phenotypic diversity on which evolution by natural selection can occur. Natural selection ultimately determines the long-term fate of mutations. If the mutation confers a selective advantage to the organism, the mutation will be selected and may eventually become very common in the population. Conversely, if the mutation is deleterious, natural selection will ensure that the mutation will be lost from the population. If the mutation is neutral, that is it neither provides a selective advantage or disadvantage, then it may persist in the population. Different forms of a gene, including those associated with "wild type" and respective mutants, in a population are termed alleles

Consequences of Mutations

For an individual, the consequence of mutations may mean little or it may mean life or death. Some deleterious mutations are null or knock-out mutations which result in a loss of function of the gene product. These mutations can arise by a deletion of the either the entire gene, a portion of the gene, or by a point mutation in a critical region of the gene that renders the gene product non-functional. These types of mutations are also referred to as loss-of-function mutations. Alternatively, mutations may lead to a modification of an existing function (i.e. the mutation may change the catalytic efficiency of an enzyme, a change in substrate specificity, or a change in structure). In rare cases a mutation may create a new or enhanced function for a gene product; this is often referred to as a gain-of-function mutation. Lastly, mutations may occur in non-coding regions of DNA. These mutations can have a variety of outcomes including altered regulation of gene expression, changes in replication rates or structural properties of DNA and other non-protein associated factors.

Suggested discussion

In the discussion above what types of scenarios would allow such a gain-of-function mutant the ability to out compete a wild type individual within the population? How do you think mutations relate to evolution?

Mutations and cancer

Mutations can affect either somatic cells or germ cells. Sometimes mutations occur in DNA repair genes, in effect compromising the cell's ability to fix other mutations that may arise. If, as a result of mutations in DNA repair genes, many mutations accumulate in a somatic cell, they may lead to problems such as the uncontrolled cell division observed in cancer. Cancers, including forms of pancreatic cancer, colon cancer, and colorectal cancer have been associated with mutations like these in DNA repair genes. If, by contrast, a mutation in DNA repair occurs in germ cells (sex cells), the mutation will be passed on to the next generation, as in the case of diseases like hemophilia and xeroderma pigmentosa. In the case of xeroderma pigmentoas individuals with compromised DNA repair processes become very sensitive to UV radiation. In severe cases these individuals may get severe sun burns with just minutes of exposure to the sun. Nearly half of all children with this condition develop their first skin cancers by age 10. 

Consequences of errors in replication, transcription and translation

Something key to think about:

Cells have evolved a variety of ways to make sure DNA errors are both detected and corrected, rom proof reading by the various DNA-dependent DNA polymerases, to more complex repair systems. Why did so many different mechanisms evolve to repair errors in DNA? By contrast, similar proof-reading mechanisms did NOT evolve for errors in transcription or translation. Why might this be? What would be the consequences of an error in transcription? Would such an error effect the offspring? Would it be lethal to the cell? What about translation? Ask the same questions about the process of translation. What would happen if the wrong amino acid was accidentally put into the growing polypeptide during the translation of a protein? Contrast this with DNA replication. 

Mutations as instruments of change

Mutations are how populations can adapt to changing environmental pressures.

Mutations are randomly created in the genome of every organism, and this in turn creates genetic diversity and a plethora of different alleles per gene per organism in every population on the planet. If mutations did not occur, and chromosomes were replicated and transmitted with 100% fidelity, how would cells and organisms adapt? Whether mutations are retained by evolution in a population depends largely on whether the mutation provides selective advantage, poses some selective cost or is at the very least, not harmful. Indeed, mutations that appear neutral may persist in the population for many generations and only be meaningful when a population is challenged with a new environmental challenge. At this point the apparently previously neutral mutations may provide a selective advantage. 

Example: Antibiotic resistance

The bacterium E. coli is sensitive to an antibiotic called streptomycin, which inhibits protein synthesis by binding to the ribosome. The ribosomal protein L12 can be mutated such that streptomycin no longer binds to the ribosome and inhibits protein synthesis. Wild type and L12 mutants grow equally well and the mutation appears to be neutral in the absence of the antibiotic. In the presence of the antibiotic wild type cells die and L12 mutants survive. This example shows how genetic diversity is important for the population to survive. If mutations did not randomly occur, when the population is challenged by an environmental event, such as the exposure to streptomycin, the entire population would die. For most populations this becomes a numbers game. If the mutation rate is 10-6 then a population of 107 cells would have 10 mutants; a population of 108 would have 100 mutants, etc.

Uncorrected errors in DNA replication lead to mutation. In this example, an uncorrected error was passed onto a bacterial daughter cell. This error is in a gene that encodes for part of the ribosome. The mutation results in a different final 3D structure of the ribosome protein. While the wildtype ribosome can bind to streptomycin (an antibiotic that will kill the bacterial cell by inhibiting the ribosome function) the mutant ribosome cannot bind to streptomycin. This bacteria is now resistant to streptomycin. 
Source: Bis2A Team original image

Suggested discussion

Based on our example, if you were to grow up a culture of E. coli to population density of 109 cells/ml; would you expect the entire population to be identical? How many mutants would you expect to see in 1 ml of culture?

An example: Lactate dehydrogenase

Lactate Dehydrogenase (LDH), the enzyme that catalyzes the reduction of pyruvate into lactic acid in fermentation, while virtually every organism has this activity, the corresponding enzyme and therefore gene differs immensely between humans and bacteria. The proteins are clearly related, they perform the same basic function but have a variety of differences, from substrate binding affinities and reaction rates to optimal salt and pH requirements. Each of these attributes have been evolutionarily tuned for each specific organism through multiple rounds of mutation and selection.

Suggested discussion

We can use comparative DNA sequence analysis to generate hypotheses about the evolutionary relationships between three or more organisms. One way to accomplish this is to compare the DNA or protein sequences of proteins found in each of the organisms we wish to compare. Let us, for example, imagine that we were to compare the sequences of LDH from three different organisms, Organism A, Organism B and Organism C. If we compare the LDH protein sequence from Organism A to that from Organism B we find a single amino acid difference. If we now look at Organism C, we find 2 amino acid differences between its LDH protein and the one in Organism A and one amino acid difference when the enzyme from Organism C is compared to the one in Organism B. Both organisms B and C share a common change compared to organism A.  

Schematic depicting the primary structures of LDH proteins from Organism A, Organism B, and Organism C. The letters in the center of the proteins line diagram represent amino acids at a unique position and the proposed differences in each sequence. The N and C termini are also noted H2N and COOH, respectively.
Attribution: Marc T. Facciotti (original work)
Figure 4 (organism_a_b_c.png)

Question: Is Organism C more closely related to Organism A or B? The simplest explanation is that Organism A is the earliest form, a mutation occurred giving rise to Organism B. Over time a second mutation arose in the B lineage to give rise to the enzyme found in Organism C. This is the simplest explanation, however we can not rule out other possibilities. Can you think of other ways the different forms of the LDH enzyme arose these three organisms?




induced mutation:

mutation that results from exposure to chemicals or environmental agents


variation in the nucleotide sequence of a genome

mismatch repair:

type of repair mechanism in which mismatched bases are removed after replication

nucleotide excision repair:

type of DNA repair mechanism in which the wrong base, along with a few nucleotides upstream or downstream, are removed


function of DNA pol in which it reads the newly added base before adding the next one

point mutation:

mutation that affects a single base

silent mutation:

mutation that is not expressed

spontaneous mutation:

mutation that takes place in the cells as a result of chemical reactions taking place naturally without exposure to any external agent

transition substitution:

when a purine is replaced with a purine or a pyrimidine is replaced with another pyrimidine

transversion substitution:

when a purine is replaced by a pyrimidine or a pyrimidine is replaced by a purine

Genomes as organismal blueprints 

A genome, not to be confused with a gnome, is an organism's complete collection of heritable information stored in DNA. Differences in information content help to explain the diversity of life we see all around us. Changes to the information encoded in the genome are the primary drivers of the phenotypic diversity we see (and some we can't) around us that are filtered by natural selection, and they are thus the drivers of evolution. This leads to questions. If every cell in a multicellular organism contains the same sequence of DNA, how can there be different cell types (e.g., how can a cell in a liver be so different from a cell in the brain if they both carry the same DNA)? And how do we read the information?

Determining a genome sequence

The information encoded in genomes provides important data for understanding life, its functions, its diversity, and its evolution. Therefore, it stands to reason that a reasonable place to begin studies in biology would be to read the information content encoded in the genome(s) in question. A good starting point is to determine the sequence of nucleotides (A, G, C, T) and their organization into one or more independently replicating units of DNA (e.g., think chromosomes and/or plasmids ). For 30+ years after the discovery that DNA is the hereditary material, this was a daunting proposition. In the late 1980s, however, the advent of semi-automated tools for DNA sequencing were pioneered, and this began a revolution that has dramatically changed how we approach the study of life. Twenty years later, in the mid-2000s, we entered a period of accelerated technological progress in which advances in materials sciences (particularly, advances in our ability to make things on a very small scale), optics, electrical and computer engineering, bioengineering, and computer sciences have all converged to bring us dramatic increases in our capacity to sequence DNA and correspondingly dramatic decreases in the cost of numerous advances in our ability to sequence DNA. A famous example to illustrate this point is to compare the changes in cost to sequence the human genome. The first draft of the human genome took nearly 15 years and $3 billion dollars to complete. Today, 10's of human genomes can be sequenced in a single day on a single instrument at a cost of less than $1000 each (the cost and time continue to decrease). Today, companies like IlluminaPacific BiosciencesOxford Nanopore, and others offer competing technologies that are driving down the cost and increasing the volume, quality, speed, and portability of DNA sequencing.  

One of the very exciting elements of the DNA sequencing revolution is that it has required and continues to require contributions from biologists, chemists, materials scientists, electrical engineers, mechanical engineers, computer scientists and programmers, mathematicians and statisticians, product developers, and many other technical experts. The potential applications and implications of unlocking barriers to DNA sequencing have also engaged investors, business people, product developers, entrepreneurs, ethicists, policy makers, and many others to pursue new opportunities and to think about how to best and most responsibly use this growing technology. 

The technological advances in genome sequencing have resulted in a virtual flood of complete genome sequences being determined and deposited into publicly available databases. You can find many of them at the National Center for Biotechnology Information. The number of available, completely sequenced genomes numbers in the tens of thousands—over 2,000 eukaryotic genomes, over 600 archaeal genomes, and nearly 12,000 bacterial genomes. Tens of thousands of more genome sequencing projects are in progress. With this many genome sequences available—or soon to be available—we can start asking many questions about what we see in these genomes. What patterns are common to all genomes? How many genes are encoded in genomes? How are these organized? How many different types of features can we find? What do the features that we find do? How different are the genomes from one another? Is there evidence that can tell us how genomes evolve? Let's briefly examine a few of these questions.

Diversity of genomes

Diversity of sizes, number of genes, and chromosomes

Let's start by examining the range of genome sizes. In the table below, we see a sampling of genomes from the database. We can see that the genomes of free living organisms range tremendously in size. The smallest known genome is encoded in 580,000 base pairs while the largest is 150 billion base pairs—for reference, recall that the human genome is 3.2 billion base pairs. That's a huge range of sizes. Similar disparities in the number of genes also exist.

Table 1. This table shows some genome data for various organisms. 2n = diploid number. Attribution: Marc T. Facciotti (own workreproduced from


Examining Table 1 also reveals that some organisms carry with them more than one chromosome. Some genomes are also polyploid, meaning that they maintain multiple copies of similar but not identical (homologous) copies of each chromosome. A diploid organism carries in its genome two homologous copies (usually one from Mom and one from Dad) of each chromosome. Humans are diploid. Our somatic cells carry 2 homologous copies of 23 chromosomes. We received 23 copies of individual chromosomes from our mother and 23 copies from our father, for a total of 46. Some plants have higher ploidy. For example, a plant with four homologous copies of each chromosome is termed tetraploid. An organism with a single copy of each chromosome is termed haploid.

Structure of genomes

Table 1 also provides clues to other points of interest. For instance, if we compare the pufferfish genome to the chimpanzee genome, we note that they encode roughly the same number of genes (19,000), but they do so on dramatically differently sized genomes—400 million base pairs versus 3.3 billion base pairs, respectively. That implies that the pufferfish genome must have much less space between its genes than what might be expected to be found in the chimpanzee genome. Indeed, this is the case, and the difference in gene density is not unique to these two genomes. If we look at Figure 1, which attempts to represent a 50-kb part of the human genome, we notice that in addition to the protein-coding regions (indicated in red and pink) that many other so-called "features" can be read from the genome. Many of these elements contain highly repetitive sequences.


Figure 1. This figure shows a 50-kb segment of the human β T-cell receptor locus on chromosome 7. This figure depicts a small region of the human genome and the types of "features" that can be read and decoded in the genome, including, but also in addition to, protein-coding sequences. Red and pink correspond to regions that encode proteins. Other colors represent different types of genomic elements. Attribution: Marc T. Facciotti (own workreproduced from


If we now look at what fraction of the whole human genome each of these types of elements makes up (see Figure 2), we see that protein-coding genes only make up 48 million of the 3.2 billion bases of the haploid genome.



Figure 2. This graph depicts how the many base pairs of DNA in the human haploid genome are distributed between various identifiable features. Note that only a small fraction of the genome is associated directly with protein-coding regions. Attribution: Marc T. Facciotti (own workreproduced from sources noted in figure)


When we examine the frequency of repeat regions versus protein-coding regions in different species, we note large differences in protein-coding versus noncoding regions.

 Figure 3. This figure shows 50-kb segments of different genomes, illustrating the highly variable frequency of repeat versus protein-coding elements in different species.  
Attribution: Marc T. Facciotti (own work
reproduced from


Suggested discussion

Propose a hypothesis for why you think some genomes might have more or fewer noncoding sequences.


Dynamics of genome structure

Genomes change over time, and numerous different types of events can change their sequence.

1. Mutations are either accumulated during DNA replication or through environmental exposure to chemical mutagens or radiation. These changes typically occur at the level of single nucleotides.  
2. Genome rearrangements describe a class of large-scale changes that can occur, and they include the following: (a) deletions—where segments of the chromosome are lost; (b) duplication—where regions of the chromosome are inadvertently duplicated; (c) insertions—the insertion of genetic material (note that sometimes this is acquired from viruses or the environment, and deletion/insertion pairs may happen across chromosomes); (d) inversions—where regions of the genome are flipped within the same chromosome; and (e) translocations—where segments of the chromosome are translocated (moved elsewhere in the chromosome). 

These changes happen at different rates, and some are facilitated by the activity of enzyme catalysts (e.g., transposases). 

The study of genomes

Comparative genomics

One of the most common things to do with a collection of genome sequences is to compare the sequences of multiple genomes to one another. In general terms, these types of activities fall under the umbrella of a field called comparative genomics.  

Comparing the genomes of people who suffer from an inheritable disease to the genomes of people who are not afflicted can help us to uncover the genetic basis for the malady. Comparing the gene content, order, and sequence of related microbes can help us find the genetic basis of why some microbes cause disease while their close cousins are virtually harmless. We can compare genomes to understand how a new species may have evolved. There are many possible analyses! The basis of these analyses is similar: look for differences across multiple genomes and try to associate those differences with different traits or behaviors in those organisms.  

Lastly, some people are comparing genome sequences to try to understand the evolutionary history of the organisms. Typically, these types of comparisons result in a graph known as a phylogenetic tree, which is a graphical model of the evolutionary relationship between the various species being compared. This field, not surprisingly, is called phylogenomics

Metagenomics: who is living somewhere and what are they doing?

In addition to studying the genomes of individual species, the increasingly powerful DNA-sequencing technologies are making it possible to simultaneously sequence the genomes of environmental samples that are inhabited by many different species. This field is called metagenomics. These studies are typically focused on trying to understand what microbial species inhabit different environments. There is great interest in using DNA sequencing to study the populations of microbes in the gut and to watch how the population changes in response to different diets, to see if there is any association between the abundance of different microbes and various diseases, or to look for the presence of pathogens. People are using DNA sequencing of environmental metagenomic samples to explore which microbes inhabit different environments on Earth (from the deep sea, to soil, to air, to hypersaline ponds, to cat feces, to some of the common surfaces we touch every day).   

In addition to discovering "who lives where," the sequencing of microbial populations in different environments can also reveal what protein-coding genes are present in an environment. This can give investigators clues into what metabolic activities might be occurring in that environment. In addition to providing important information about what kind of chemistry might be happening in a specific environment, the catalog of genes that is accumulated can also serve as an important resource for the discovery of novel enzymes for applications in biotechnology.





The study of nucleic acids began with the discovery of DNA, progressed to the study of genes and small fragments, and has now exploded to the field of genomics. Genomics is the study of entire genomes, including the complete set of genes, their nucleotide sequence and organization, and their interactions both within a species and with other species. The advances in genomics have been made possible by DNA sequencing technology. Just as information technology has led to Google Maps, enabling us to get detailed information about locations around the globe, genomic information is used to create similar maps of the DNA of different organisms.

Mapping genomes

Genome mapping is the process of finding the location of genes on each chromosome. The maps that are created are comparable to the maps that we use to navigate streets. A genetic map is an illustration that lists genes and their location on a chromosome. Genetic maps provide the big picture (similar to a map of interstate highways) and use genetic markers (similar to landmarks). A genetic marker is a gene or sequence on a chromosome that shows genetic linkage with a trait of interest. The genetic marker tends to be inherited with the gene of interest. One measure of distance between them is the recombination frequency during meiosis; early geneticists called this linkage analysis.

Physical maps get into the intimate details of smaller regions of the chromosomes (similar to a detailed road map). A physical map is a representation of the physical distance, in nucleotides, between genes or genetic markers. Both genetic linkage maps and physical maps are required to build a complete picture of the genome. Having a complete map of the genome makes it easier for researchers to study individual genes. Human genome maps help researchers in their efforts to identify human disease-causing genes related to illnesses such as cancer, heart disease, and cystic fibrosis, to name a few. In addition, genome mapping can be used to help identify organisms with beneficial traits, such as microbes with the ability to clean up pollutants or even prevent pollution. Research involving plant genome mapping may lead to agricultural methods that produce higher crop yields or to the development of plants that adapt better to climate change.

Figure 1. This is a physical map of the human X chromosome.

Credit: modification of work by NCBI, NIH

Genetic maps provide the outline, and physical maps provide the details. It is easy to understand why both types of genome-mapping techniques are important to show the big picture. Information obtained from each technique is used in combination to study the genome. Genomic mapping is used with different model organisms that are used for research. Genome mapping is still an ongoing process, and as more advanced techniques are developed, more advances are expected. Genome mapping is similar to completing a complicated puzzle using every piece of available data. Mapping information generated in laboratories all over the world is entered into central databases, such as the National Center for Biotechnology Information (NCBI). Efforts are made to make the information more easily accessible to researchers and the general public. Just as we use global positioning systems instead of paper maps to navigate through roadways, NCBI allows us to use a genome viewer tool to simplify the data mining process.

Whole genome sequencing

Although there have been significant advances in the medical sciences in recent years, doctors are still confounded by many diseases, and researchers are using whole genome sequencing to get to the bottom of the problem. Whole genome sequencing is a process that determines the DNA sequence of an entire genome. Whole genome sequencing is a brute-force approach to problem solving when there is a genetic basis at the core of a disease. Several laboratories now provide services to sequence, analyze, and interpret entire genomes.

In 2010, whole genome sequencing was used to save a young boy whose intestines had multiple mysterious abscesses. The child had several colon operations with no relief. Finally, a whole genome sequence revealed a defect in a pathway that controls apoptosis (programmed cell death). A bone marrow transplant was used to overcome this genetic disorder, leading to a cure for the boy. He was the first person to be successfully diagnosed using whole genome sequencing.

The first genomes to be sequenced, such as those belonging to viruses, bacteria, and yeast, were smaller in terms of the number of nucleotides than the genomes of multicellular organisms. The genomes of other model organisms, such as the mouse (Mus musculus), the fruit fly (Drosophila melanogaster), and the nematode (Caenorhabditis elegans) are now known. A great deal of basic research is performed in model organisms because the information can be applied to other organisms. A model organism is a species that is studied as a model to understand the biological processes in other species that can be represented by the model organism. For example, fruit flies are able to metabolize alcohol like humans, so the genes affecting sensitivity to alcohol have been studied in fruit flies in an effort to understand the variation in sensitivity to alcohol in humans. Having entire genomes sequenced helps with the research efforts in these model organisms.

Figure 2. Much basic research is done with model organisms, such as the mouse, Mus musculus; the fruit fly, Drosophila melanogaster; the nematode, Caenorhabditis elegans; the yeast, Saccharomyces cerevisiae; and the common weed, Arabidopsis thaliana.

Credit: "mouse": modification of work by Florean Fortescuecredit; "nematodes": modification of work by "snickclunk"/Flickr; "common weed": modification of work by Peggy Greb, USDA; scale-bar data from Matt Russell

The first human genome sequence was published in 2003. The number of whole genomes that have been sequenced steadily increases and now includes hundreds of species and thousands of individual human genomes.

Applying genomics

The introduction of DNA sequencing and whole genome sequencing projects, particularly the Human Genome Project, has expanded the applicability of DNA sequence information. Genomics is now being used in a wide variety of fields, such as metagenomics, pharmacogenomics, and mitochondrial genomics. The most commonly known application of genomics is to understand and find cures for diseases.

Predicting disease risk at the individual level

Predicting the risk of disease involves screening and identifying currently healthy individuals by genome analysis at the individual level. Intervention with lifestyle changes and drugs can be recommended before disease onset. However, this approach is most applicable when the problem arises from a single gene mutation. Such defects only account for about five percent of diseases found in developed countries. Most of the common diseases, such as heart disease, are multifactorial or polygenic, which refers to a phenotypic characteristic that is determined by two or more genes, and also environmental factors such as diet. In April 2010, scientists at Stanford University published the genome analysis of a healthy individual (Stephen Quake, a scientist at Stanford University, who had his genome sequenced); the analysis predicted his propensity to acquire various diseases. A risk assessment was done to analyze Quake’s percentage of risk for 55 different medical conditions. A rare genetic mutation was found that showed him to be at risk for sudden heart attack. He was also predicted to have a 23 percent risk of developing prostate cancer and a 1.4 percent risk of developing Alzheimer’s disease. The scientists used databases and several publications to analyze the genomic data. Even though genomic sequencing is becoming more affordable and analytical tools are becoming more reliable, ethical issues surrounding genomic analysis at a population level remain to be addressed. For example, could such data be legitimately used to charge more or less for insurance or to affect credit ratings?

Genome-wide association studies

Since 2005, it has been possible to conduct a type of study called a genome-wide association study, or GWAS. A GWAS is a method that identifies differences between individuals in single nucleotide polymorphisms (SNPs) that may be involved in causing diseases. The method is particularly suited to diseases that may be affected by one or many genetic changes throughout the genome. It is very difficult to identify the genes involved in such a disease using family history information. The GWAS method relies on a genetic database that has been in development since 2002 called the International HapMap Project. The HapMap Project sequenced the genomes of several hundred individuals from around the world and identified groups of SNPs. The groups include SNPs that are located near eachother on chromosomes so they tend to stay together through recombination. The fact that the group stays together means that identifying one marker SNP is all that is needed to identify all the SNPs in the group. There are several million SNPs identified, but identifying them in other individuals who have not had their complete genome sequenced is much easier because only the marker SNPs need to be identified.

In a common design for a GWAS, two groups of individuals are chosen; one group has the disease, and the other group does not. The individuals in each group are matched in other characteristics to reduce the effect of confounding variables causing differences between the two groups. For example, the genotypes may differ because the two groups are mostly taken from different parts of the world. Once the individuals are chosen, and typically their numbers are a thousand or more for the study to work, samples of their DNA are obtained. The DNA is analyzed using automated systems to identify large differences in the percentage of particular SNPs between the two groups. Often the study examines a million or more SNPs in the DNA. The results of GWAS can be used in two ways: the genetic differences may be used as markers for susceptibility to the disease in undiagnosed individuals, and the particular genes identified can be targets for research into the molecular pathway of the disease and potential therapies. An offshoot of the discovery of gene associations with disease has been the formation of companies that provide so-called “personal genomics”, which will identify risk levels for various diseases based on an individual’s SNP complement. The science behind these services is controversial.

Because GWAS looks for associations between genes and disease, these studies provide data for other research into causes, rather than answering specific questions themselves. An association between a gene difference and a disease does not necessarily mean there is a cause-and-effect relationship. However, some studies have provided useful information about the genetic causes of diseases. For example, three different studies in 2005 identified a gene for a protein involved in regulating inflammation in the body that is associated with a disease-causing blindness called age-related macular degeneration. This opened up new possibilities for research into the cause of this disease. A large number of genes have been identified to be associated with Crohn’s disease using GWAS, and some of these have suggested new hypothetical mechanisms for the cause of the disease.


Pharmacogenomics involves evaluating the effectiveness and safety of drugs on the basis of information from an individual's genomic sequence. Personal genome sequence information can be used to prescribe medications that will be most effective and least toxic on the basis of the individual patient’s genotype. Studying changes in gene expression could provide information about the gene transcription profile in the presence of the drug, which can be used as an early indicator of the potential for toxic effects. For example, genes involved in cellular growth and controlled cell death, when disturbed, could lead to the growth of cancerous cells. Genome-wide studies can also help to find new genes involved in drug toxicity. The gene signatures may not be completely accurate, but can be tested further before pathologic symptoms arise.


Traditionally, microbiology has been taught with the view that microorganisms are best studied under pure culture conditions, which involves isolating a single type of cell and culturing it in the laboratory. Because microorganisms can go through several generations in a matter of hours, their gene expression profiles adapt to the new laboratory environment very quickly. On the other hand, many species resist being cultured in isolation. Most microorganisms do not live as isolated entities, but in microbial communities known as biofilms. For all of these reasons, pure culture is not always the best way to study microorganisms. Metagenomics is the study of the collective genomes of multiple species that grow and interact in an environmental niche. Metagenomics can be used to identify new species more rapidly and to analyze the effect of pollutants on the environment. Metagenomics techniques can now also be applied to communities of higher eukaryotes, such as fish.

Figure 3. Metagenomics involves isolating DNA from multiple species within an environmental niche. The DNA is cut up and sequenced, allowing entire genome sequences of multiple species to be reconstructed from the sequences of overlapping pieces.

Creation of new biofuels

Knowledge of the genomics of microorganisms is being used to find better ways to harness biofuels from algae and cyanobacteria. The primary sources of fuel today are coal, oil, wood, and other plant products such as ethanol. Although plants are renewable resources, there is still a need to find more alternative renewable sources of energy to meet our population’s energy demands. The microbial world is one of the largest resources for genes that encode new enzymes and produce new organic compounds, and it remains largely untapped. This vast genetic resource holds the potential to provide new sources of biofuels.

Figure 4. Renewable fuels were tested in Navy ships and aircraft at the first Naval Energy Forum.

Credit: modification of work by John F. Williams, US Navy

Mitochondrial genomics

Mitochondria are intracellular organelles that contain their own DNA. Mitochondrial DNA mutates at a rapid rate and is often used to study evolutionary relationships. Another feature that makes studying the mitochondrial genome interesting is that in most multicellular organisms, the mitochondrial DNA is passed on from the mother during the process of fertilization. For this reason, mitochondrial genomics is often used to trace genealogy.

Genomics in forensic analysis

Information and clues obtained from DNA samples found at crime scenes have been used as evidence in court cases, and genetic markers have been used in forensic analysis. Genomic analysis has also become useful in this field. In 2001, the first use of genomics in forensics was published. It was a collaborative effort between academic research institutions and the FBI to solve the mysterious cases of anthrax that was transported by the US Postal Service. Anthrax bacteria were made into an infectious powder and mailed to news media and two U.S. Senators. The powder infected the administrative staff and postal workers who opened or handled the letters. Five people died, and 17 were sickened from the bacteria. Using microbial genomics, researchers determined that a specific strain of anthrax was used in all the mailings; eventually, the source was traced to a scientist at a national biodefense laboratory in Maryland.

Figure 5. Bacillus anthracis is the organism that causes anthrax.

Credit: modification of work by CDC; scale-bar data from Matt Russell

Genomics in agriculture

Genomics can reduce the trials and failures involved in scientific research to a certain extent, which could improve the quality and quantity of crop yields in agriculture. Linking traits to genes or gene signatures helps to improve crop breeding to generate hybrids with the most desirable qualities. Scientists use genomic data to identify desirable traits, and then transfer those traits to a different organism to create a new genetically modified organism, as described in the previous module. Scientists are discovering how genomics can improve the quality and quantity of agricultural production. For example, scientists could use desirable traits to create a useful product or enhance an existing product, such as making a drought-sensitive crop more tolerant of the dry season.

Figure 6. Transgenic agricultural plants can be made to resist disease. These transgenic plums are resistant to the plum pox virus.

Credit: Scott Bauer, USDA ARS


Proteins are the final products of genes that perform the function encoded by the gene. Proteins are composed of amino acids and play important roles in the cell. All enzymes (except ribozymes) are proteins and act as catalysts that affect the rate of reactions. Proteins are also regulatory molecules, and some are hormones. Transport proteins, such as hemoglobin, help transport oxygen to various organs. Antibodies that defend against foreign particles are also proteins. In the diseased state, protein function can be impaired because of changes at the genetic level or because of direct impact on a specific protein.

A proteome is the entire set of proteins produced by a cell type. Proteomes can be studied using the knowledge of genomes because genes code for mRNAs, and the mRNAs encode proteins. The study of the function of proteomes is called proteomics. Proteomics complements genomics and is useful when scientists want to test their hypotheses that were based on genes. Even though all cells in a multicellular organism have the same set of genes, the set of proteins produced in different tissues is different and dependent on gene expression. Thus, the genome is constant, but the proteome varies and is dynamic within an organism. In addition, RNAs can be alternatively spliced (cut and pasted to create novel combinations and novel proteins), and many proteins are modified after translation. Although the genome provides a blueprint, the final architecture depends on several factors that can change the progression of events that generate the proteome.

Genomes and proteomes of patients suffering from specific diseases are being studied to understand the genetic basis of the disease. The most prominent disease being studied with proteomic approaches is cancer (Figure 7). Proteomic approaches are being used to improve the screening and early detection of cancer; this is achieved by identifying proteins whose expression is affected by the disease process. An individual protein is called a biomarker, whereas a set of proteins with altered expression levels is called a protein signature. For a biomarker or protein signature to be useful as a candidate for early screening and detection of a cancer, it must be secreted in bodily fluids such as sweat, blood, or urine, so that large-scale screenings can be performed in a noninvasive fashion.

The current problem with using biomarkers for the early detection of cancer is the high rate of false-negative results. A false-negative result is a negative test result that should have been positive. In other words, many cases of cancer go undetected, which makes biomarkers unreliable. Some examples of protein biomarkers used in cancer detection are CA-125 for ovarian cancer and PSA for prostate cancer. Protein signatures may be more reliable than biomarkers to detect cancer cells. Proteomics is also being used to develop individualized treatment plans, which involves the prediction of whether or not an individual will respond to specific drugs and the side effects that the individual may have. Proteomics is also being used to predict the possibility of disease recurrence.

Figure 7. This machine is preparing to do a proteomic pattern analysis to identify specific cancers so that an accurate cancer prognosis can be made.

Credit: Dorie Hightower, NCI, NIH

The National Cancer Institute has developed programs to improve the detection and treatment of cancer. The Clinical Proteomic Technologies for Cancer and the Early Detection Research Network are efforts to identify protein signatures specific to different types of cancers. The Biomedical Proteomics Program is designed to identify protein signatures and design effective therapies for cancer patients.

Section summary

Genome mapping is similar to solving a big, complicated puzzle with pieces of information coming from laboratories all over the world. Genetic maps provide an outline for the location of genes within a genome, and they estimate the distance between genes and genetic markers on the basis of the recombination frequency during meiosis. Physical maps provide detailed information about the physical distance between the genes. The most detailed information is available through sequence mapping. Information from all mapping and sequencing sources is combined to study an entire genome.

Whole genome sequencing is the latest available resource to treat genetic diseases. Some doctors are using whole genome sequencing to save lives. Genomics has many industrial applications including biofuel development, agriculture, pharmaceuticals, and pollution control.

Imagination is the only barrier to the applicability of genomics. Genomics is being applied to most fields of biology; it can be used for personalized medicine, prediction of disease risks at an individual level, the study of drug interactions before the conduction of clinical trials, and the study of microorganisms in the environment as opposed to the laboratory. It is also being applied to the generation of new biofuels, genealogical assessment using mitochondria, advances in forensic science, and improvements in agriculture.

Proteomics is the study of the entire set of proteins expressed by a given type of cell under certain environmental conditions. In a multicellular organism, different cell types will have different proteomes, and these will vary with changes in the environment. Unlike a genome, a proteome is dynamic and under constant flux, which makes it more complicated and more useful than the knowledge of genomes alone.