Skip to main content
Biology LibreTexts

3: Isolating and analyzing genes

Recombinant DNA, Polymerase Chain Reaction and Applications to Eukaryotic Gene Structure and Function

The first two chapters covered many important aspects of genes, such as how they function in inheritance, how they code for protein (in general terms) and their chemical nature. All this was learned without having a single gene purified. A full understanding of a gene, or the entire set of genes in a genome, requires that they be isolated and then studied intensively. Once a gene is “in hand”, in principal one can determine both its biochemical structures and its function(s) in an organism. One of the goals of biochemistry and molecular genetics is to assign particular functions to individual or composite structures. This chapter covers some of the techniques commonly used to isolate genes and illustrates some of the analyses that can be done on isolated genes.

Methods to purify some abundant proteins were developed early in the 20th century, and some of the experiments on the fine structure of the gene (colinearity of gene and protein for trpA and tryptophan synthase) used microbial genetics and proteins sequencing. However, methods to isolate genes were not developed until the 1960’s, and the were applicable to only a few genes.

All this changed in the late 1970’s with the development of recombinant DNA technology, or molecular cloning. This technique enabled researchers to isolate any gene from any organism from which one could isolate intact DNA (or RNA). The full potential to provide access to all genes of organisms is now being realized as full genomes are sequenced. One of the by-products of the intense investigation of individual DNA molecules after the advent of recombinant DNA was a procedure to isolate any DNA for which one knows the sequence. This technique, called the polymerase chain reaction (PCR), is far easier than traditional molecular cloning methods, and it has become a staple of many laboratories in the life sciences. After covering the basic techniques in recombinant DNA technology and PCR, their application to studies of eukaryotic gene structure and function will be discussed.

Like many advances in molecular genetics, recombinant DNA technology has its roots in bacterial genetics.

Transducing phage 

The first genes isolated were bacterial genes that could be picked up by bacteriophage. By isolating these hybrid bacteriophage, the DNA for the bacterial gene could be recovered in a highly enriched form. This is the basic principal behind recombinant DNA technology.

Some bacteriophage will integrate into a bacterial chromosome and reside in a dormant state (Fig. 3.1). The integrated phage DNA is called a prophage, and the bacterium is now a lysogen. Phage that do this are lysogenic. Induction of the lysogen will result in excision of the prophage and multiplication to produce many progeny, i.e. it enters a lytic phasein which the bacteria are broken open and destroyed. The nomenclature is descriptive. The bacteria carrying the prophage show no obvious signs of the phage (except immunity to superinfection with the same phage, covered later in Part Four), but when induced (e.g. by stress or UV radiation) they will generate a lytic state, hence they are called lysogens. Induced lysogens make phage from the prophage that was integrated. Phage that always multiply when they infect a cell are called lytic.

Excision of a prophage from a lysogen is notalways precise. Usually only the phage DNA is cut out of the bacterial chromosome, but occassionally some adjacent host DNA is included with the excised phage DNA and encapsidated in the progeny. These transducing phageare usually biologically inactive because the piece of the bacterial chromosome replaces part of the phage chromosome; these can be propagated in the presence of helper phage that provide the missing genes when co-infected into the same bacteria. When DNA from the transducing phage is inserted into the newly infected cell, the bacterial genes can recombineinto the host chromosome, thereby bringing in new alleles or even new genes and genetically altering the infected cell. This process is called transduction.

 

Figure 3.1. Transfer of bacterial genes by transduction: A lac+ transducing phage can convert a lac‑ strain to lac+ by infection (and subsequent crossing over).

Note that the transducing phage are carrying one or a small number of bacterial genes. This is a way of isolating the genes. The bacterial gene in the transducing phage has been separated from the other 4000 bacterial genes (in E. coli). By isolating large numbers of the transducing phage, the phage DNA, including the bacterial genes, can be obtained in large quantitiesfor biochemical investigation. One can isolate mg or mg quantities of a single DNA molecule, which allows for precise structural determination and detailed investigation. 

A generalized transducing phagecan integrate at many different locations on the bacterial chromosome. Imprecise excision from any of those locations generates a particular transducing phage, carrying a short sections of the bacterial genome adjacent to the integration site. Thus a generalized transducing phage such as P1 can pick up many different parts of the E. coli genome.

A specialized transducing phageintegrates into only one or very few sites in the host genome. Hence it can carryonly a few specific bacterial genes, e.g., l lac(Fig. 3.2).

 

 

Figure 3.2.An example of a l transducing phage carrying part of the lacoperon.

 

This process of isolating a particular bacterial gene on a transducing phage is mimicked in recombinant DNA technology, in which a gene or genome fragment from any organism is isolated on a recombinant phage or plasmid.

 

 

 

Overview of Recombinant DNA Technology

Recombinant DNA technologyutilizes the power of microbiological selection and screening procedures to allow investigators to isolate a gene that represents as little as 1 part in a million of the genetic material in an organism. The DNA from the organism of interest is divided into small pieces that are then placed into individual cells (usually bacterial). These can then be separated as individual colonies on plates, and they can be screened through rapidly to find the gene of interest. This process is called molecular cloning.

Joining DNA in vitro to form recombinant molecules

Restriction endonucleasescut at defined sequences of (usually) 4 or 6 bp. This allows the DNA of interest to be cut at specific locations. The physiological function of restriction endonucleases is to serve as part of system to protect bacteria from invasion by viruses or other organisms. (See Chapter 7)

Table 3.1. List of restriction endonucleases and their cleavage sites. A ' means that the nuclease cuts between these 2 nucleotides to generate a 3' hydroxyl and a 5' phosphate.
Enzyme Site   Enzyme Site
AluI AG'CT   NotI GC'GGCCGC
BamHI G'GATCC   PstI CTGCA'G
BglII A'GATCT   PvuII CAG'CTG
EcoRI G'AATTC   SalI G'TCGAC
HaeIII GG'CC   Sau3AI 'GATC
HhaI GCG'C   SmaI CCC'GGG
HincII GTY'RAC   SpeI A'CTAGT
HindIII A'AGCTT   TaqI T'CGA
HinfI G'ANTC   XbaI T'CTAGA
HpaII C'CGG   XhoI C'TCGAG
KpnI GGTAC'C   XmaI C'CCGGG
MboI 'GATC      

 

N = A,G,C or T

R = A or G

Y = C or T

S = G or C

W = A or T

 

a.      Sticky ends

 (1)   Since the recognition sequences for restriction endonucleases are pseudopalindromes, an off-center cleavage in the recognition site will generate either a 5' overhang or a 3' overhang with self-complementary (or "sticky") ends.

 

 e.g. 5' overhang    EcoRI        G'AATTC

   BamHI      G'GATCC

 

 3' overhangPstI   CTGCA'G

 

(2)    When the ends of the restriction fragments are complementary,

 

 e.g. for EcoRI      5'‑‑‑G   AATTC‑‑‑3'

   3'‑‑‑CTTAA  G‑‑‑5'

 

the ends can anneal to each other. Any two fragments, regardless of their origin (animal, plant, fungal, bacterial) can be joined in vitro to form recombinant molecules (Fig. 3.3).

 

Figure 3.3.

 

b.     Blunt ends

(1)    The restriction endonuclease cleaves in the center of the pseudopalindromic recognition site to generate blunt (or flush) ends.

 (2)   E.g.  HaeIII        GG'CC

HincII        GTY'RAC

 

T4 DNA ligase is used to tie together fragments of DNA (Fig. 3.4). Note that the annealed "sticky" ends of restriction fragments have nicks(usually 4 bp apart). Nicks are breaks in the phosphodiester backbone, but all nucleotides are present. Gapsin one strand are missing a string of nucleotides.

T4 DNA ligase uses ATP as source of adenylyl group attached to 5' end of the nick, which is a good leaving group after attack by the 3' OH. (See Chapter 5 on Replication).

At high concentration of DNA ends and of ligase, the enzyme can also ligate together blunt‑ended DNA fragments. Thus any two blunt‑ended fragments can be ligated together. Note: Any fragment with a 5' overhang can be readily converted to a blunt‑ended molecule by fill‑in synthesis catalyzed by a DNA polymerase (often the Klenow fragment of DNA polymerase I). Then it can be ligated to another blunt‑ended fragment.

 

Figure 3.4

Linkers are short duplex oligonucleotides that contain a restriction endonuclease cleavage site. They can be ligated onto any blunt‑ended molecule, thereby generating a new restriction cleavage site on the ends of the molecule. Ligation of a linker on a restriction fragment followed by cleavage with the restriction endonuclease is one of several ways to generate an end that is easy to ligate to another DNA fragment.

Annealing of homopolymer tailsare another way to joint two different DNA molecules.

The enzyme terminal deoxynucleotidyl transferasewill catalyze the addition of a string of nucleotides to the 3' end of a DNA fragment. Thus by incubating each DNA fragment with the appropriate dNTP and terminal deoxynucleotidyl transferase, one can add complementary homopolymers to the ends of the DNAs that one wants to combine. E.g., one can add a string of G's to the 3' ends of one fragment and a string of C's to the 3' ends of the other fragment. Now the two fragments will join together via the homopolymer tails.

 

 

 

Figure 3.5. Use of linkers (left) and homopolymer tails (right) to make recombinant DNA molecules.

Introduction of recombinant DNA into cell and replication: Vectors

Vectors used to move DNA between species, or from the lab bench into a living cell, must meet three requirements (Fig. 3.6).

  1. They must be autonomously replicatingDNA molecules in the host cell. The most common vectors are designed for replicating in bacteria or yeast, but there are vectors for plants, animals and other species.
  2. They must contain a selectable markerso cells containing the recombinant DNA can be distinguished from those that do not. An example is drug resistance in bacteria.
  3. They must have aninsertionsiteto accomodate foreign DNA. Usually a unique restriction cleavage site in a nonessential region of the vector DNA. Later generation vectors have a set of about 15 or more unique restriction cleavage sites.

 

Figure 3.6. Summary of vectors for molecular cloning

Plasmid vectors

Plasmids are autonomously replicating circular DNA moleculesfound in bacteria. They have their own origin of replication, and they replicate independently of the origins on the "host" chromosome. Replication is usually dependent on host functions, such as DNA polymerases, but regulation of plasmid replication is distinct from that of the host chromosome. Plamsids, such as the sex-factor F, can be very large (94 kb), but others can be small (2‑4 kb). Plasmids do not encode an essential function to the bacterium, which distinguishes them from chromosomes.

Plasmids can be present in a single copy, such as F, or in multiple copies, like those used as most cloning vectors, such as pBR322, pUC, and pBluescript.

In nature, plasmids provide carry some useful function, such as transfer (F), or antibiotic resistance. This is what keeps the plasmids in a population. In the absence of selection, plasmids are lost from bacteria.

The antibiotic resistance genes on plasmids are often carried within, or are derived from, transposons, a types of transposable element. These are DNA segments that are capable of "jumping" or moving to new locations (see Chapter 9).

 

A plasmid that was widely used in many recombinant DNA projects is pBR322 (Fig. 3.7). It replicates from an origin derived from a colicin-resistance plasmid (ColE1). This origin allows a fairly high copy number, about 100 copies of the plasmid per cell. Plasmid pBR322 carries two antibiotic resistance genes, each derived from different transposons. These transposons were initially found in R-factors, which are larger plasmids that confer antibiotic resistance.

 

 

Figure 3.7. Features of plasmid pBR322. The gene conferring resistance to ampicillin (ApR) can be interrupted by insertion of a DNA fragment into the PstI site, and the gene conferring resistance to tetracycline (TcR) can be interrupted by insertion of a DNA fragment into the BamHI site. Replication is controlled by the ColE1 origin.

Use of the TcR and ApR genes allows for easy screening for recombinants carrying inserts of foreign DNA. For instance, insertion of a restriction fragment in the BamHI site of the TcR gene inactivates that gene. One can still select for ApR colonies, and then screen to see which ones have lost TcR .

 

Question 3.1. What effects on drug resistance are seen when you use the EcoRI or PstI sites in pBR322 for inserting foreign DNA?

 

A generation of vectors developed after pBR322 are designed for even more efficient screening for recombinant plasmids, i.e. those that have foreign DNA inserted. The pUC plasmids (named for plasmid universal cloning) and plasmids derived from them use a rapid screen for inactivation of the b-galactosidase gene to identify recombinants (Fig. 3.8).

One can screen for production of functional b‑galactosidasein a cell by using the chromogenic substrate X‑gal(a halogenated indoyl b‑galactoside). When cleaved by b‑galactosidase, the halogenated indoyl compound is liberated and forms a blue precipitate. The pUC vector has the b‑galactosidase gene {actually only part of it, but enough to form a functional enzyme with the rest of the gene that is encoded either on the E. coli chromosome or an F' factor}. When introduced into E. coli, the colonies are blue on plates containing X‑gal.

The multiple cloning sites(unique restriction sites) are in the b‑galactosidase gene (lacZ). When a restriction fragment is introduced into one or more of these sites, the b‑galactosidase activity is lost by this insertional mutation. Thus cells containing recombinant plasmids form white(not blue) colonies on plates containing X‑gal.

The replication origin is a modified ColE1 origin of replication that has been mutated to eliminate a negative control region. Hence the copy number is very high(several hundred or more plasmid molecules per cell), and one obtains an very high yield of plasmid DNA from cultures of transformed bacteria. The plasmid has ApR as a selectable marker.


 

 

 

Figure 3.8. pUC-type vectors

 

Introduction of a recombinant DNA molecule into a host cell

 

Introduction into CaCl2 treated E. coli: transformation

E. colidoes not have a natural system for taking up DNA, but when treated with CaCl2, the cells will take up the added DNA (Fig. 3.9). The recombinant vectors will give a new phenotype to the cells (usually drug resistance), so this process can be considered DNA-mediated transformation. An average efficiency is about 106 transformants per mg of DNA, although some more elaborate transformation cocktails procedures can give up to about 108 transformants per mg of DNA.

Figure 3.9. DNA-mediated transformation of E. coli.

Usually one will transform with a mixture of recombinant vector molecules, most of which carry a different restriction fragment. Each transformed E. coli cell will pick up only oneplasmid molecule, so the complex mixture of plasmids in the ligation mix has been separated into a population of transformed bacteria (Fig. 3.9). The bacterial cells are then plated at a sufficiently low density that individual colonies can be identified. Each colony (or transformant) carries a single plasmid, so as one screens the colonies, one is actually screening through individual DNA molecules. A colony is a visible group of bacterial cells on a plate, all of which are derived from a single bacterial cell. A group of identical cells derived from a single cell is called a clone. Since each clone carries a single type of recombinant DNA molecule, the process is called molecular cloning.

Phage vectors for more efficient introduction of DNA into bacteria

Phage vectors such as those derived from bacteriophage l can carry larger insertsand can be introduced into bacteria more efficiently. l phage has a duplex DNA genome of about 50 kb. The internal 20 kb can be replaced with foreign DNA and still retain the lytic functions. Hence restriction fragments up to 20 kb can replace the l sequences, allowing larger genomic DNA fragments to be cloned (Fig. 3.10).

Figure 3.10.  Lambda vectors for cloning.

Recombinant bacteriophage can be introduction intoE. coliby infection. DNA that has the cohesive ends of l can be packaged in vitro into infective phage particles. Being in a viral particle brings the efficiency of infection reliably over 108 plaque forming units per mg of recombinant DNA.

Some other bacteriphage vectors for cloning are derived from the virus M13. One can obtain single stranded DNAfrom M13 vectors and recombinants. M13 is a virus with a genome of single stranded DNA. It has a nonessential region into which foreign genes can be inserted. It has been modified to carry a gene for b‑galactosidase as a way to screen for recombinants. Introduction of recombinant M13 DNA into E. coli will lead to an infection of the host, and the progeny viral particles will contain single‑stranded DNA. The replicative form is duplex, allowing one to cleave with restriction enzymes and insert foreign DNA.

Some vectors are hybrids between plasmids and single‑strand phage; these are calledphagemids. One example is pBluescript. Phagemids are plasmids (with the modified, high-copy number ColE1 origin) that also have an M13 origin of replication. Infection of transformed bacteria (containing the phagemid) with a helper virus (e.g. derived from M13) will cause the M13 origin to be activated, and progeny viruses carrying single‑stranded copies of the phagemid can be obtained. Hence one can easily obtain either double‑ or single‑stranded forms of thes plasmids. {The "blue" comes from the blue‑white screening for recombinants that can be done when the multiple cloning sites are in the b‑galactosidase gene. The "script" refers to the ability to make RNA copies of either strand in vitro with phage RNA polymerases.}

Vectors designed to carry larger inserts

Fragments even larger than those carried in l vectors are useful for studies of longer segments of chromosomes or whole genomes. Several vectors have been designed for cloning these very large fragments, 50 to 400 kb.

Cosmids are plasmids that have the cohesive ends of l phage. They can be packaged in vitro into infective phage particles to give a more efficient delivery of the DNA into the cells. They can carry about 35 to 45 kb inserts (Fig. 3.6).

Yeast artificial chromosomes(YACs) are yeast vectors with centromeres and telomeres. They can carry about 200 kb or larger fragments (in principle up to 1000 kb = 1 Mb). Thus very large fragments of DNA can be cloned in yeast (Fig. 3.11). In practice, chimeric clones with fragments from different regions of the genome are obtained fairly often, and some of the inserts are unstable.

Figure 3.11

Vectors derived from bacteriophage P1can carry fragments of about 100 kb. Fragments in a similar size range are also cloned into bacterial artificial chromosomes(BACs), which are derived from the F-factor (Fig. 3.12). These have a lower copy number (like F) but they are stable and relatively easy to work with in the laboratory. BACs have become one of the most frequently used vectors for large inserts in genome projects.

Figure 3.12.

Shuttle vectors for testing functions of isolated genes

Shuttle vectors can replicate in two different organisms, e.g. bacteria and yeast, or mammalian cells and bacteria. They have the appropriate origins of replication. Hence one can, e.g. clone a gene in bacteria, maybe modify it or mutate it in bacteria, and test its function by introducing it into yeast or animal cells.

 

 

Polymerase Chain Reaction, or PCR

 

The polymerase chain reaction,orPCR, is now one of the most commonly used assays for obtaining a particular segment of DNA or RNA. It is rapid and extremely sensitive. By amplifying a designated segment of DNA, it provides a means to isolate that particular DNA segment or gene. This method requires knowledge of the nucleotide sequence at the ends of the region that you wish to amplify. Once that is known, one can make large quantities of that region starting with miniscule amounts of material, such as the DNA within a single human hair. With the availability of almost complete or complete sequences of genomes from many species, the range of genes to which it can be applied is enormous. The applications of PCR are numerous, from diagnostics to forensics to isolation of genes to studies of their expression.

The power of PCR lies in the exponential increase in amount of DNA that results from repeated cycles of DNA synthesis from primers that flank a given region, one primer designed to direct synthesis complementary to the top strand, the other designed to direct synthesis complementary to the bottom strand (Fig. 3.13. When this is done repeatedly, there is roughly a 2-fold increase in the amount of synthesized DNA in each cycle. Thus it is possible to generate a million-fold increase in the amount of DNA from the amplified region with a sufficient number of cycles. This exponential increase in abundance is similar to a chemical chain reaction, hence it is called the polymerase chain reaction.

 

 

Figure 3.13. Polymerase Chain Reaction (PCR)

The events in the polymerase chain reaction are examined in more detail in Fig. 3.14. The several panels show what happens in each cycle. Each cycle consists of a denaturation step at a temperature higher than the melting temperature of the duplex DNA (e.g. 95oC ), then an annealing step at a temperature below the melting temperature for the primer-template (e.g. 55oC), followed by extension of the primer by DNA polymerase using dNTPs provided in the reaction. This is done at the temperature optimum for the DNA polymerase (e.g. 70oC for a thermostable polymerase). Thermocylersare commercially available for carrying out many cycles quickly and reliably.

The template supplied for the reaction is the only one availablein the first cycle, and it is still a major template in the second cycle. At the end of the second cycle, a product is made whose ends are defined by primers. This is the desired product, and it serves as the major template for the remaining cycles. The initial template is still present and can be used, but it does not undergo the exponential expansion observed for the desire product.

If nis the number of cycles, the amount of desired product is approximately 2n-1 –2 times the amount of input DNA (between the primers). Thus in 21 cycles, one can achieve a million-fold increase in the amount of that DNA (assuming all cycles are completely efficient). A sample with 0.1 pg of the segment of DNA between the primers can be amplified to 0.1 mg in 21 cycles, in theory. In practice, roughly 25-35 cycles are done in many PCR assays.

The ease if doing PCR was greatly increased by the discovery of DNA polymerases that were stable at high temperatures. These have been isolated from bacteria that grow in hot springs, such as those found in Yellowstone National Park, such as Thermus aquaticus. The Taq polymerase from this bacterium will retain activity even at the high temperatures needed for melting the templates, and it is active at a temperature between the melting and annealing temperature. This particular polymerase is rather error-prone, and other thermostable polymerases have been discovered that are more accurate.

 

 

 

 

Figure 3.14. Steps in the polymerase chain reaction.

cDNA clones are copies of mRNAs

Construction of cDNA clones involves the synthesis of complementary DNA from mRNA and then inserting a duplex copy of that into a cloning vector, followed by transformation of bacteria (Fig. 3.15).

a.      First strand synthesis:

 

First, one anneals an oligo dT primer onto the 3' polyA tail of a population of mRNAs. Then reverse transcriptase will begin DNA synthesis at the primer, using dNTPs supplied in the reaction, and copy the mRNA into complementary DNA, abbreviated cDNA.

 

The mRNA is degraded by the RNase H activity associated with reverse transcriptase and by subsequent treatment with alkali. 

 

b.      Second strand synthesis:

 

For the primer to make the second strand of DNA (equivalent in sequence to the original mRNA), one can utilize a transient hairpin at the end of the cDNA. (The basis for its formation is not certain.) In other schemes, one generates a primer binding site and uses a primer directed to that site; one way to do this is by homopolymer tailing of the cDNA followed by use of a complementary primer. Random primers can also be used for second strand synthesis; although this precludes the generation of a full-length cDNA (i.e. a copy of the entire mRNA). However, it is rare to generate duplex copies of the entire mRNA by any means.

 

DNA polymerase (e.g. Klenow polymerase) is used to synthesize the second strand, complementary to the cDNA. The product is duplex cDNA.

 

If the hairpin was used to prime second strand synthesis, it must be opened by a single‑strand specific nuclease such as S1.

 

c.      Insertion of the duplex cDNA into a cloning vector:

 

One method is to use terminal deoxynucleotidyl transferase to add a homopolymer such as poly-dC to the ends of the duplex cDNA and a complementary homopolymer such as poly-dG to the vector.

 

An alternative approach is to use linkers; these can be employed such that a linker carrying a cleavage site for one restriction endonuclease is on the 5' end of the duplex cDNA and a linker carrying a cleavage site for a different restriction endonuclease is on the 3' end. (In this context, 5’ and 3’ refer to the nontemplate, or "top" strand.) This allows "forced" cloning into the vector, and one has initial information about orientation, based on proximity to one cleavage site or the other.

 

The cDNA and vector are joined at the ends, using DNA ligase, to form recombinant cDNA plasmids (or phage).

 

d.      The ligated cDNA plasmids are then transformed into E. coli. The resulting set of transformants is a library of cDNA clones.

 

 

Figure 3.15. Making cDNA clones

 

Screening methods for cDNA clones

 

a.      Brute force examination of individual cDNA plasmids.

 

If the mRNA is highly abundant in a given tissue, then many of the cDNA clones will be copies of that mRNA. One can examine DNA from individual clones and test for characteristic restriction cleavage patterns or a particular sequence. This was a common approach for screening cDNAs in the early days of recombinant DNA technology.

Starting in the mid-1990’s, cooperative efforts from corporations (such as Merck) and publicly funded genome centers (such as at Washington University) have generated the sequence of individual clones from large cDNA libraries from many tissues from human, mouse, and rat. Other consortia have sequenced cDNA libraries from other species. Each sequence is called an “expressed sequence tag” or EST. These are now a major source of partially or fully characterized cDNA clones. Hundreds of thousands of ESTs are available, and contain at part of the DNA sequence from many, if not most, human genes. The web site for NCBI (http://www.ncbi.nlm.nih.gov) is an excellent resource for examining the ESTs.

 

b.      Hybridization with a gene‑specific probe.

 

If the sequence of the desired cDNA is known, or if the sequence from homologs from related species is known, one can use synthetic oligonucleotides (or other source of the diagnostic sequence) as a radiolabeled hybridization probe to identify the cDNA of interest.

If the amino acid sequence has been determined for all or even just parts of the protein product of the gene of interest, then one can chemically synthesize oligonucleotides based on the genetic code for those amino acids. The oligonucleotides need to be at least 18 nucleotides or longer (so that they will anneal to specific sites in the genome), and because the genetic code is degenerate (more than one codon per amino acid; discussed in Part Two), they have to be degenerate as well. The oligonucleotides can be used directly as hybridization probes, although it is becoming more common to amplify the region between two oligonucleotides using the polymerase chain reaction, and to use that amplification product as a labeled probe.

The process of hybridization screening is illustrated schematically in Fig. 3.16. The colonies of bacteria, each with a single cDNA plasmid, are transferred to a solid substrate (such as a nylon or nitrocellulose membrane), lysed. and the released DNA immobilized onto the membrane. Hybridization of this membrane (with the DNA attached) to a specfic probe allows one to screen through thousands of colonies in a single experiment.

 

 

Figure 3.16 Hybridization Screening

 

 

c.      Express the cDNA, i.e. make the protein product encoded by the mRNA, and screen for that protein product (Fig. 3.17). This is often in bacteria by constructing the clones in a vector that has an active E. coli promoter (for transcription) and efficient translation signals upstream from the site at which the cDNAs were inserted. The transformed bacterial cells will express the encoded protein, and one tries to identify it. One can also screen for expression in yeast, plant or mammalian cells. The expression vector has to contain gene-regulatory signals (such as promoters and enhancers, see Part Three) that allow expression of the desired gene in the appropriate cell.

 

 

 

Figure 3.17. Screening for an Expressed Gene Product

 

 

(1) One can use specific antiserato detect the desired colony expressing the gene of interest.

 

(2) One can use a labeled ligand that will bind to the expressed cDNA on the cell surface. For example, cDNAs for receptors can be expressed in an appropriate cell (usualy mammalian cells in culture) and identified by newly-acquired ability to bind a labeled hormone (such as growth hormone or erythropoietin)

 

(3) by complementationof a known mutation in the host. E.g. a cDNA for the human homolog to yeast p34cdc2 was isolated by its ability to complement a yeast mutant that had lost the function of this key regulator of progress through the cell cycle.

 

(4) Expression cloning can be done in mammalian cells, as long as one can screen or select for a new function generated by the expression. Use of this method to isolate the receptor for the glycoprotein hormone erythropoietin is illustrated in Fig. 3.18.

 

 

Figure 3.18. Expression screening in eukaryotic cells.

 

 

d. Differential analysis

 

Often one is interested in finding all the genes (or their mRNAs) that are expressed uniquely in some differentiated or induced state of cells. Two classic examples are (i) identifying the genes whose products regulate the determination process that causes a multipotential mouse cell line (like 10T1/2 cells) to differentiate into muscle cells, and (ii) ,using the fact that the T-cell receptor is expressed only in T-lymphocytes, but not in their sister lineage B-lymphocytes, to help isolate cDNA clones for that mRNA. Both of these projects used subtractive hybridization to highly enrich for the cDNA clones of interest.

 

In this technique, the cDNA from the differentiating or induced cell of interest is hybridized to mRNA from a related cell line, but which has not undergone the key differentiation step. This allows one to remove mRNA-cDNA duplexes that contain the cDNAs for all the genes expressed in common between the two types of cells. The resulting single-stranded are enriched for the cDNAs that are involved in the process under study.

 

The subtractive hybridization scheme used in isolation of the muscle determination gene MyoDis illustrated in Fig. 3.19.

 

A conceptually equivalent strategy, using PCR (see next section) rather than cDNA cloning, is differential display of PCR products from cells that differ by some process (e.g. differentiation, induction, growth arrest versus stimulation, etc.). In this technique, one uses several sets of PCR primers annealed to cDNA to mRNA from the two types of cells that are being compared. The sets of primers are empirically designed to allow many regions of cDNA to be amplified. The amplification products are resolved (or displayed) on polyacrylamide gels, and the products specific to the cell type of interest are isolated and used to screen through cDNA libraries. This technique is also called representational difference analysis.

 

 

 

 

Figure 3.19.Differential screening to find cDNAs of mRNAs expressed only in certain cell-types.

 

 

The advent of sequencing all or a very large number of genes from various organisms (e.g. E. coli, yeast, Drosophila, humans) has allowed the development of high-density microchip arrays of DNA from each gene. One can hybridize RNA from cells or tissues of interest, isolated under various metabolic conditions, to identify all (known) genes expressed. Even more useful are assays for genes whose expression changesduring a shift in cell metabolism (cell cycle, heat shock, hormonal induction, etc.) or as a result of mutation of some other gene (e.g. a gene encoding a transcription factor of interest). This powerful new technology is being used more and more to examine global effects on gene expression.

 

For a description (and movie) of the Affymetrix GeneChip, go to

http://www.affymetrix.com/technology/index.html

 

 

 

Genomic DNA clones

 

Clones of genomic DNA, containing individual fragments of chromosomal DNA, are needed for many purposes. Some examples include:

 

§      to obtain detailed structures of genes,

§      to identify regulatory regions, i.e. DNA sequences needed for correct expression of the gene,

§      to map and analyze alterations to the genome, e.g. the isolate genes that when mutated cause a hereditary disease,

§      to direct alterations in the genome, e.g. by homologous recombination to replace a wild-type allele with a mutant one (to test function of the gene in mouse) or vice versa(to cure a hereditary disease, perhaps eventually in humans).

 

 

Construction of libraries of genomic DNA fragments in cloning vectors

 

Genomic DNA is digested with restriction enzymes (Fig. 3.20.) The more frequently an enzyme cuts (the shorter the recognition sequence), the smaller the average size of DNA fragments. Some enzymes cut very infrequently, such as NotI (8 bp recognition sequence) and can be used to generate very large fragments. Alternatively, one can do a partial digest (not all sites are cleaved) with a particular enzyme and isolate the products that are in the desired size range (e.g. 20 kb). A particularly clever way to do this is to digest partially with Sau3AI or MboI (both cut at 'GATC) and ligate these fragments into vector cut with BamHI (cuts at G'GATCC) ‑ i.e. they have the same sequence in the overhang (or sticky end). In this process one uses vectors that can accomodate large DNA fragments, such as l phage vectors, cosmids, YACs or P1 vectors.

 

 

 

 

 

 

Figure 3.20.Construction of a library of genomic DNA

 

 

Screening methods for genomic DNA clones

 

One method is to use complementationof a mutation in the host to select or screen for the desired gene. This works just like the situation for cDNA clones described above, and it requires that the cloned fragments be expressed in the host cell.

 

Far more common is to screen by hybridizationwith gene‑specific probes (Fig. 3.21). Frequently the cDNA clone is found first, and the genomic clone then isolated by hybridization screening (using the cDNA clone as a probe) against a library of genomic DNA fragments.

 

 

Figure 3.21. Screening a library of genomic DNA

 

 

 

 

Eukaryotic gene structure

 

Much can be learned about any gene after it has been isolated by recombinant DNA techniques. The structure of coding and noncoding regions, the DNA sequence, and more can be deduced. This is true for bacterial and viral genes, as well as eukaryotic cellular genes. The next sections of this chapter will focus on analysis of eukaryotic genes, showing the power of examining purified copies of genes.

 

Split genes and introns

 

Precursors to mRNA longer than mRNA

 

Initial indications of a complex structure to eukaryotic genes came from analysis of nuclear RNAs during the 1970’s. The precursors to messenger RNA, or pre-mRNAs, were found to be surprisingly long, considerably larger than the average mRNA size (Fig. 3.22).

 

Figure 3.22.

 

Denaturing sucrose gradients (with high concentration of formamide, e.g. >50%) separate RNAs on the basis of size. Analysis of nuclear RNA showed that the average size was much larger than the average size of cytoplasmic RNA.

 

Labeled RNA could be "chased" from the nucleus to the cytoplasm ‑ i.e. nuclear RNA was a precursor to mRNA and other cytoplasmic RNAs. Was the extra RNA at the ends? or in the middle of the pre‑mRNA?

 

More precisely, one could examine specific RNAs by hybridizing fractions from the denaturing sucrose gradients to labeled copies of, e.g. globin mRNA. The hybridizing RNA from the nucleus was about 11S (as well as mature 8S message), whereas cytoplasmic RNA of about 8S hybridized. Thus the nuclear RNA encoding globin is larger than the cytoplasmic mRNA.

 

 

Visualization of mRNA-DNA heteroduplexes revealed extra sequences internal to the mRNA-coding segments

 

R-loopsare hybrids between RNA and DNA that can be visualized in the EM, under conditions where DNA‑RNA duplexes are favored over DNA‑DNA duplexes (Fig. 3.23). For a simple gene structure, one sees a continuous RNA‑DNA duplex (smooth, slowly curving) and a displaced single strand of DNA (thinner, many more turns and curves – single stranded DNA is not a rigid as double stranded nucleic acid, either duplex DNA or RNA-DNA).

 

Figure 3.23.

 

 

EM pictures of duplexes between purified adenovirus mRNAs and the genomic DNA showed extensions at both the 3' (poly A) and 5' ends, which are encoded elsewhere on the genome. All late mRNAs have the same sequence at the 5' end; this is dervied from from the tripartite leader. R‑loops between late mRNAs and adenovirus DNA fragments including the major late promoter showed duplexes with the leader segments, separated by loops of duplex DNA (Fig. 3.23, bottom panel). The RNA-DNA hybrids identify regions of DNA that encode RNA. The surprising result is that RNA-coding portions of a gene are separated by loops of duplex DNA in the R-loop analysis. Examples of R-loops in genes with introns are shown in Fig. 3.24.

These data showed that the adenovirus RNAs are encoded in different segments of the viral genome; i.e. the genes are split. The portion of a gene that encodes mRNA was termed an exon. The part of gene does not code for sequences in the mature mRNA is called an intron. These observations led to the Nobel Prize for Phil Sharp and Rich Roberts. Louise Chow and Sue Berget were also key players in the discovery of introns.

 

 

Figure 3.24. R-loops between clones of rabbit beta-like globin genes (now called HBEand HBG) and mRNA from rabbit embryonic erythroid cells. A photograph from the electron microscope is shown at the the top of each panel, and an interpretive drawing is included below it. The displaced nontemplate strand of DNA forms partial or complete duplexes with the template strand in the large intron. A small intron is also visible in panel C. Panel G shows the two genes together on one large clone.

 

Interruptions in cellular geneswere discovered subsequently, in the late 1970's, in globin genes, immunoglobulin genes and others. We now realize that mostgenes in complex eukaryotes are split by multiple introns.

Exons are more conserved than introns (in most cases), since alterations in protein-coding regions that alter or decrease function are selected against, whereas many sequences in introns can be altered without affecting the function of the gene product. Important sequences in introns (such as splice junctions, the branch point, and occassionally enhancers) are covered in some detail in Part Three.

 

 

Differences in restiction maps between cDNA and genomic clones reveal introns

 

Restriction maps based on copies of the mRNA (cDNA) were different from those in genomic DNA ‑ the genes were cleaved by some restriction endonucleases that the cDNAs were not, and some restriction sites were further apart in the genomic DNA. These observations were explained by the presence of intervening sequences or introns (Fig. 3.25).

 

 

Figure 3.25.

 

 

The experimental procedures to do this involve making a restriction mapof the clones of genomic DNA, and then identifying the regions that encode mRNA by hybridization of labeled cDNA probesto the restriction digests. Cloned genomic DNA digested with appropriate restriction endonucleases, separated by size on an agarose gel, and then transferred onto a nylon or nitrocellulose solid support. This Southern blot(see Chapter 2) is then hybridized with a labeled probe specific to the cDNA (composed only of exons). The pattern of labeled fragments on the resulting autoradiogram shows the fragments that contain exons. Alignment of these with the restriction map of the gene gives an approximation of the position of the exons.

The blot-hybridization approach can be combined with a PCR (polymerase chain reaction) analysis for higher resolution. Primers are synthesized that will anneal to adjacent exons. The difference in size of the PCR amplification product between genomic DNA and cDNA is the size of the intron. The PCR product can be cloned and sequenced for more detailed information, e.g. to precisely define the exon/intron junctions.

 

Subsequently, the nucleotide sequence of exonic regions and preferably the entire gene is determined. The presence of introns were confirmed and their locations defined precisely in DNA sequences of isolated clones of the genes. 

 

Types of exons

 

Eukaryotic genes are a combination of introns and exons. However, not all exons do the same thing (Fig. 3.26). In particular, the protein-coding regions or genes are a subset of the sequences in exons. Exons include both the untranslated regions and the protein-coding, translated regions. Introns are the segments of genes that are present in the primary transcript (or precursor RNA) but are removed by splicing in the production of mature RNA. Methods used to detect coding regions will not find all exons.

 

 

Figure 3.26.Types of exons

 

Multiple, large introns can make some eukaryotic genes very large

 

Eukaryotic genes can be split into many (>60), sometimes very small exons (e.g. <60 bp, coding for <20 amino acids), separated by very large introns (as large as >100kb), resulting in some enormous genes (>500 kb). E.g. the DMDgene (which when mutated can cause Duchenne's muscular dystrophy) is almost 1 Mb, about 1/4 the size of the E. colichromosome!

The average size of genes from more complex organisms is considerably larger than those of simpler ones, but the avg. size of mRNA is about the same, reflecting the presence of more and larger introns in the more complex organisms.

 

tRNA and rRNA genes also contain introns

 

Finding exons in long genomic sequences using computer programs

 

Far more exons and introns have been discovered (or more accurately, predicted) throught the analysis of genomic DNA sequences than could ever be discovered by direct experimentation. The different types of exons, the enormous length of introns, and other factors have complicated the task of finding reliable diagnostic signatures for exons in genomic sequences. However, considerable progress has been made and continues in current research. Some of the commonly used approaches are summarized in Fig. 3.27.

 

 

Figure 3.27. Introns in the b-globin gene can be reliably identified computat

 

 

Introns are removed by splicing RNA precursors

 

 

Figure 3.28. Introns are removed from pre-mRNA to generate mRNA.

 

 

 

Alternative splicinggenerates more than one polypeptide from the same gene

 

 

Figure 3.29.

 

Some segments of RNA may be included in the mature mRNA (exons) but not included on other spliced products. The alternative products may be made in different tissues or at different developmental stages ‑ i.e. alternative splicing can be regulated.

 

 

 

 

 

Split genes may enhance the rate of evolution

Many exons encode a unit very close to a protein domain, e.g. the exons of leghemoglobin, or the variable and constant regions of immunoglobulins, or domains (e.g. "kringle") in EGF precursor that are also found in part of the LDL receptor. The exon organization tends to be well conserved in highly divergent species. Introns tend to occur between those portions of genes that encode structural domains of proteins.

Duplication of the exons encoding structural domains and subsequent recombination can lead to more rapid evolution of a new protein, essentially using the parts from earlier evolved genes. Analogous to building a house from prefabricated parts, as opposed to one nail and one board at a time ‑ start with preassembled walls, roof joists etc.

However, the relationship between exons and structural domains of proteins is not exact, and some exon‑intron boundaries vary (a little) in genes for different species. A different model holds that the introns are transposable elements (some certainly are ‑ see later). They can insert anywhere in a gene, but they are least disruptive at domain boundaries, and these latter insertions are more likely to be fixed in a population than insertions into the middle of a region encoding a domain. So the results after long years of evolution is that the introns tend to be between region coding domains, but the gene was originally intact, not assembled from discrete exons.

Multigene families and gene clusters

Many eukaryotic genes are found in multiple copies. Some of them are developmentally regulated, such as HOXgene clusters and globin gene clusters .

 

 

Figure 3.30.

 

A multigene familycontains multiple genes of similar sequence encoding similar proteins; e.g. globin genes (Fig. 3.30). Globin genes are expressed at different times of development. The order of developmental expression is the same as their order along the chromosome, e.g. the e-globin gene is expressed in early embryonic red cells, the g-globin gene is expressed at a high level in fetal red cells, and the b-globin gene is expressed in red cells after birth. As we will see later, this correlates with their distance from a dominant control element at the 5' end of the cluster, the Locus Control Region.

 

The order of HOXgenes is also aligned with their spatial expression in the embryo. This is another example of alignment between chromosomal position and regulation of expression.

 

Other multi‑gene families include those encoding histones, immunoglobulins, actins, cyclins, cyclin‑dependent protein kinases, and rRNAs. Some of these families are linked in gene clusters, but others are dispersed around the genome. Having multiple copies of genes may be more the rule than the exception in eukaryotic genomes.

 

Experimental techniques that reveal multigene families include the following.

 

Purification and analysis of a particular kind of protein, e.g. hemoglobins, immunoglobulins, and many enzymes, may reveal heterogeneity. Further purification (via chromatography and electrophoresis) and sequencing can show that the observed heterogeneity is a result of related but not identical proteins, and one deduces that these similar proteins are encoded by multiple genes with similar sequences, i.e. a multigene family.

 

Analysis of the clones obtained by screening a library of cloned genomic DNA may reveal multiple related sequences, each with a distinctive restriction map. In many cases these are clones of different, related genes that comprise a multigene family (Fig. 3.31).

 

Southern blot‑hybridization of restriction‑cleaved genomic DNA can reveal multiple copies of genes, simply as multiple bands on the hybridized blot. Although the number of fragments generated from total genomic DNA is too many to resolve on a gel, after transfer to a membrane, particular fragments can be visualized by hybridization with a specific probe. The number of hybridizing fragments is roughly correlated with the number of copies of related genes. Some genes are cleaved by the restriction enzyme, producing multiple bands, but some fragments can have multiple genes. A true measure of the number of related genes comes from more detailed restriction mapping or sequencing.

 

Figure 3.31. Blot-hybridization analysis of clones of genomic DNA and genomic DNA showing that mutliple copies of genes are present. A set of overlapping clones containing rabbit genomic DNA were digested and run on an agarose gel (panel A), blotted onto a membrane and hybridized with a radiolabeled probe that detected embryonic hemoglobin genes, and exposed to X-ray film. The resulting autoradiogram is shownin panel B. Panel C shows the results of a blot-hybridization analysis of rabbit total genomic DNA, using the same probe. Many of the same bands are seen as in the cloned DNA, confirming the existence of multiple hybridizing fragments. Mapping the fragments showed that they represented separate genes.

Keeping multigene families homogeneous

Sometimes multiple copies of genes are maintained as virtually identical over the course of evolution: e.g. rRNA genes, histone genes, a‑globin genes (in primates). In these cases, the multiple copies are coevolving(concerted evolution).

     sequence differences

Human:      A  | A   | A  |    among human genes:     1%

   between human & chimp5%

Chimp:       A  | A   | A  |     among chimp genes:     1%

   between chimp & monkey       10%

Monkey:    A  | A   | A  |     among monkey genes:   1%

 

Since all three primates have 3 A genes, we infer that the common ancestor had 3 genes (the duplications preceded the speciation events). If in the time since human and chimp diverged, the A genes have diverged 5%, why haven't the A genes in human (e.g.) also diverged 5% from each other? They have been apart even longer than the human and chimp chromosomes carrying them! The A genes within a species are "talking to each other", or co‑evolving or evolving in concert.

Sequence homogeneity in a multigene family can arise because of recent gene amplification (Fig. 3.32 part1). In this case the genes have not been separate from each other long enough to accumulate variation in their sequences. Other multigene families have existed for a long time, but maintain sequence homogeneity despite ample opportunity for divergence. Two mechanisms have been seen that maintain similarity. The first is multiple rounds of unequal crossing over. As illustrated in Fig. 3.32, part 2, the expansions and contractions of repeated genes can result in a new variant predominanting in the gene cluster. The other method for maintaining homogeneity is gene conversion between homologs. When a new mutation arises, it can be removed by conversion with the unmutated allele, or the mutation can be passed on the the other allele. Either way, the sequences of the two alleles becomes the same.

Sometimes the products of the gene duplications, or duplicative transpositions, accumulate mutations so they are no longer functional. These remnants of once‑active genes are called pseudogenes.

 

Figure 3.32.

Functional analysis of isolated genes

 

Gene expression

"Northern blots" or RNA blot‑hybridization

In the reverse of Southern blot‑hybridizations, one can separate RNAs by size on a denaturing agarose gel, and transfer them to nylon or other appropriate solid support. Labeled DNA can then be used to visualize the corresponding mRNA (Fig. 3.33). Ed Southern initially used labeled rRNA to find the complementary regions in immobilized, digested DNA, so this "reverse" of Southern blot-hybridizations, i.e. using a labeled DNA probe to hybridize to immobilized RNA, is often referred to as "Northern" blot‑hybridizations.

One can hybridize a labeled DNA clone to a panel of RNA samples from a wide variety of tissues to determine in what tissues a particular cloned gene is expressed (top panel of Fig. 3.33. More precisely, this technique reveals the tissues in which the genes is transcribed into stable RNA. The results allow one to determine the tissue specificity of expression, e.g. a gene may only be expressed in liver, or only in erythroid cells (e.g. the b-globin gene). This helps give some general idea of the possible function of the gene, since it should reflect the function of that tissue. Other genes are expressed in almost all cells or tissue types (such as GAPDH); these are referred to as housekeeping genes. They are involved in functions common to all cells, such as basic energy metabolism, cell structure, etc. The relative amounts of RNA in the different lanes can be directly compared to see, e.g., which tissues express the gene most abundantly

One can hybridize a labeled DNA clone to a panel of RNA samples from a progressive stages of development to determine the developmental stagewhen during development a particular cloned gene is expressed as RNA (bottom panel of Fig. 3.33). For instance, a gene product may be required for determination decisions early in development, and only be expressed in early embryos.

Once the DNA sequence of the gene of interest is known, and its intron-exon structure determined, highly sensitive RT-PCR assayscan be designed (Fig. 3.34). The RNA from the cell or tissue of interest is copied into cDNA using reverse transcriptase and dNTPs, and then primers are annealed for PCR. Ideally, the primers are in different exons so that the product of amplifying the cDNA will be smaller than the product of amplifying the genomic DNA.

 

Figure 3.33.

 

 

 

Figure 3.34.Reverse transcription-PCR (RT-PCR) assay for mRNA.

 

In situ hybridizations / immunochemistry

In complementary approaches, the labeled DNA can be hybridized in situto thin sections of a tissue or embryo or other specimen, and the resulting pattern of grains visualized along the specimen in the microscope (Fig. 3.35). Also, antibody probes against the protein product can be used to localize it in the specimen. This gives a more detailed picture of the pattern of expression, with resolution to the particular cells that are expressing the gene. The RNA blot-hybridization techniques described in a. above look at the RNA in all the cells from a tissue, and do not provide the level of resolution to single cells.

 

 

Figure 3.35.

Microarrays

As large numbers of sequenced mRNAs and genes become available, technology has been developed to look at expression of very large numbers of genes simulatneously. DNA sequences specific for each gene in a bacterium or yeast can be spotted in a high-density array with 400 r more spots. Some technologies use many more spots, with mutliple sequences per gene. Microarrays, or “gene chips” are available for many species, some with tens of thousands of different sequences or “probes.” RNA from different tissues can be converted to cDNA with a distinctive fluorescent label, and then hybridized to the gene chip. Differences in level of expression can be measured. Thus global changes in gene expression can now be measured.

 

 

Figure 3.36.Hybridization of RNA to high density microarrays of gene sequences, or “gene chips”.

Database searches

An increasingly powerful approach is to determine candidates for the the function of your gene by searching the databaseswith the sequence, looking for matches to known proteins and genes. These matches provide clues as to protein function. 

The power of this approach increases as the amount of sequences deposited in databases expand.  Sequences of many genes are already known. The sequenced genes from more complex organisms, such as plants and animals, tend to be the ones more easily isolated using the techniques discussed in recombinant DNA technology. However, the sequences of genes expressed at a low level are starting to accumulate in the databases.

One remarkable advance in the past few years is the increasing number of organisms whose entire genome has been sequenced. About 10 bacterial genomes have been sequenced, and the number increases every few months. Genomics sequences for two eukaryotes are now available. That of the yeast Saccharomyces cerevisiaehas been known for a few years, and the genome of the nematode Caenorhabditis eleganswas completed in 1998. These sequences are being analyzed intensively, and a very high fraction of all the genes in each genome can be reliably detected using computational tools (one part of bioinformatics). It has become clear that many of the enzymes used in basic metabolism, regulation of the cell cycle, cellular signaling cascades, etc. are highly conserved across a broad phylogenetic spectrum. Thus it is common to find significant sequence matches in the genomes of model organisms when they are queried by the sequence of a previously unknown gene, e.g. from humans or mouse. The function already established for that gene in worms or yeast is a highly reliable guide to the function of the homologous gene in humans. The worm C. elegansis multicellular, and fate of each of its cells during development has been mapped. Thus it is possible that many functions involved in cellular interactions and cell-cell signaling will be conserved in this species, thus expanding the list of potential targets for a search in the databases.

This potential is being realized as working draft sequences of the human and mouse genomes are being analyzed. Within these data is a good approximation of sequences from virtually all human and mouse genes. Random clones have been partially sequenced from libraries of cDNAs from various human tissues, normalized to remove much of the products of abundant mRNAs and thus increasing the frequency of products of rare mRNAs. These sequences from the ends of the cDNA clones are called expressed sequence tags, or ESTs.  The name is derived from the fact that since they are in cDNA libraries, they are obviously expressed at the level of mRNA, and some are used as tags in generating high-resolution maps of human chromosome. Hundreds of thousands of these have now been sequenced in collaborative efforts between pharmaceutical companies, other companies and universities. The database dbEST records all those in the public domain, and it is a strong complement to the databases recording all known sequences of genes. Many different parts of the same, or highly related, cDNAs, are recorded as separate entries in dbEST. Projects are underway to group all the sequences from the same (or highly related) gene into a a unified sequence. One example is the Unigene project at NCBI.  The number of entries grows continually, but in the summer of 1998 there are about 50,000 entries, each representing about one gene. The number is higher now. Current estimates of the number of human genes are around 30,000, so it is possible that some UniGene clusters represent only parts of genes, and some genes match more than one cluster.

Very efficient search engines have been designed for handling queries to these databases, and several are freely available over the World Wide Web. One of the most popular and useful sites for this and related activities is maintained by the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/). Their Entrez browser provides integrated access to sequence, mapping and some functional information, PubMed provides access to abstracts of papers in journals in the National Library of Medicine, and the BLAST server allows rapid searches through various sequence databases. dbEST and the Unigene collection are maintained here, many genome maps are available, and three-dimensional structures of proteins and nucleic acids are available. 

Make the protein product and analyze it

It is often possible to express the gene and make the encoded protein in large amounts. The protein can be purified and assayed for various enzymatic or other activities. Hypotheses for such activities may come from database searches.

 

Directed mutation

The previously describe approaches give some idea about gene function, but they do not firmly establish those functions. Indeed, this is a modern problem of trying to assign a function to an isolated gene. Several “reverse genetic” approaches can now be taken to tackle this problem. The most powerful approach to determining the physiological role(s) of a gene product is to mutatethe gene in an appropriate organism and search for an alteredphenotype.

The easiest experiment to do, but sometimes most difficult to interpret, is a gain of functionassay. In this case, one forces expression of the gene in a transgenic organism, which often already has a wild type copy of the gene. One can look for a phenotype resulting from over-expressionin tissues where it is normally expressed, or ectopic expressionin tissues where it is normally silent.

In some organisms, it is possible to engineer a loss of functionof the gene. The most effective method is to use homologous recombination to replace the wild type gene with one engineered to have no function. Thisknock-outmutationwill prevent expression of the endogenous gene and one can see the effects on the whole organism. Unfortunately, the efficiency of homologous recombination is low in many organisms and cell lines, so this is not always feasible. Other methods for knocking out expression are being developed, although the mechanism for their effect (when successful) is still being studied. In some cases, one can block expression of the endogenous gene by forcing production of antisenseRNA. Another method that is effective in some, but currently not all organisms, is the use of double-stranded, interfering RNA (RNAi). Duplex RNAs less than 30 nucleotide pairs long from the gene of interest can prevent expression of genes in worms, flies, and plants. Some success in mammals was recently reported.

Another way to generate a loss-of-function phenotype is to express dominant negativeallelesof the gene. These mutant alleles encode stable proteins that form an aberrant structure that prevents functioning of the endogenous protein. This usually requires some protein-protein interaction (e.g. homodimers or heterodimers).

Localization on a genetic map

Sometimes the gene you have isolated maps to a region on a chromosome with a known function. Of course, many genes are probably located in that region, so it is critical to show that a candidate gene really is the one that when mutated causes an altered phenotype. This can be done by showing that a wild type copy of the candidate gene will restore a normal phenotype to the mutant. If a marker is known to be very tightly linked to the candidate gene, one can test whether this marker is always in linkage disequilibrium with the determinant of the mutant phenotype, i.e. in a large number of crosses, the marker for the candidate gene and the mutant phenotype never separated by recombination.

The mapping is often done with gene‑specific probes for in situ hybridizationsto mitotic chromosomes. One then aligns the hybridization pattern with the chromosome banding patterns to map the isolated gene. Another method is to hybridize to a panel of DNAs from hybrid cells that contain only part of the chromosomal complement of the genome of interest. This is particularly powerful with radiation hybrid panels