5.3: Saccharomyces genome project provided the reference sequence
- Page ID
The completion of the S. cerevisiae genome project (Goffeau et al., 1996) represented a milestone in yeast genetics. S. cerevisiae had been an important genetic model for over 50 years, but associating genes with phenotypes was a slow process. In classical genetics, researchers generate collections of mutants and then map the genes responsible for mutant phenotypes
by monitoring the segregation of traits during meiosis. Traits that are inherited together more than 50% of the time are assigned to the same linkage group, because they are located on the same chromosome. (Recall Mendel’s law of independent assortment.) The more frequently that two traits are inherited together, the closer they are on a chromosome and the least likely to be separated by recombination during meiosis.
Prior to the genome project, yeast geneticists had identified hundreds of linkage groups, which were gradually assembled into genetic maps of 16 chromosomes. The genetic maps contained approximately 1000 known genes, including several genes involved in Met biosynthesis. Over several decades, yeast geneticists had isolated over a hundred mutants that were unable
to synthesize Met and these mutants had been placed into 21 complementation groups, the functional equivalents of genes (Masselot & DeRobichon-Szulmajster, 1975). However, the exact chromosomal locations of most MET genes was unknown. The figure on the right shows the positions of MET and CYS genes that were mapped to yeast chromosomes by classical genetic methods (Cherry et al., 1997). By the time that the genome project began, researchers were
also using recombinant DNA technology to identify genes that were deficient in mutant strains, so partial sequence information was available for many chromosomal regions. This sequence information proved to be invaluable in the interpretation of the genome project data.
The S. cerevisiae genome was the first eukaryotic genome to be decoded. The success of the S. cerevisiae genome project can be attributed to the impressive amount of collaboration within the yeast research community. Over 600 researchers in 92 laboratories contributed sequence data that was compiled to generate a highly accurate genome sequence of S. cerevisiaestrain 288C (Goffeau et al., 1996). A single yeast strain was chosen for DNA sequencing, becauseS. cerevisiae naturally accumulates mutations and laboratory strains can begin to diverge from one another as they are propagated in the lab (Mortimer, 2000). The deletion strains that we are using in this class (Winzeler et al., 1999) are derived from strain 288C.
The ~12 million base pair (Mbp) DNA sequence provides the definitive physical map of the 16 yeast chromosomes. Computational analysis of the sequence predicted ~6000 open read- ing frames (ORFs), each representing a potential gene. The number of ORFs was considerably greater than the number of genes that had been previously mapped with genetic methods. Many ORFs were identified by their similarity to genes that had been studied in other organisms, while close to half of the ORFs were completely novel. (Over time, additional ORFs have been identi- fied. Today, the number of dubious or uncharacterized S. cerevisiae ORFs is close to 1500.) The S. cerevisiae genome sequence generally confirmed the gene order predicted by the earlier genetic maps, but provided more accurate spacing for the distances separating individual yeast genes.
Chromosome map of the S. cerevisiae genome.
S. cerevisiae has 16 chromosomes that were originally identified by genetic linkage and subsequently confirmed by DNA sequencing. Chromosome numbers were assigned in the order that they were identified by classical linkage analysis.
(First, read the coordinate information on the following page.)
The S. cerevisiae genome contains two genes, SAM1 and SAM2, encoding enzymes that catalyze the conversion of Met to the high energy methyl donor, S-adenosylmethionine. The two genes arose from a gene duplication and remain almost identical to one another. The systematic name for SAM1 is YLR180W, and the systematic name for SAM2 is YDR502C. Use the coordinate information below to determine the chromosomal locations of SAM1 and SAM2. Place the two genes on the map above. Draw arrows that indicate the direction of transcription for both genes.
The genome project data provided the organizing structure for the SaccharomycesGenome Database (SGD). The SGD systematically assigned accession numbers to ORFs, based on their location and orientation on yeast chromosomes. The systematic name for each ORF has 7 characters. Each begins with a “Y” for yeast, followed by letters depicting the chromosome number and chromosome arm, followed by a 3-digit ORF number counting away from the centromere. The last letter in the locus name indicates if transcription occurs on the Watson or Crick strand of the DNA.
The figure on the opposite page outlines the process used by the genome project to decode and annotate the S. cerevisiae sequence. The complete sequences of the 16 yeast chromosomes laid end-to-end are considered the reference genome for S. cerevisiae. The genome sequence
was submitted to NCBI’s GenBank, where curators assigned an NC____ accession number
to each of the 16 chromosome sequences, indicating that the sequences are non-redundant chromosome sequences. Potential protein-coding sequences were identified with an ORF-finding algorithm that looks for sequences that begin with an ATG initiation codon and terminate with
a stop codon in the same reading frame. ORF finding programs rely on the fact that stop codons are underrepresented in protein coding sequences. Because 3 of the total 64 codons are stop codons, one would predict a stop codon to randomly occur about once in every 21 amino acids
in a protein sequence. Most proteins, however, contain 100 amino acids or more. ORF-finders are also able to identify and exclude introns from the ORF. Each potential ORF identified in the project was assigned an NM______ accession number, consistent with a transcript sequence, or potential mRNA sequence.
Computational methods were used to predict the amino acid sequences of the proteinsencoded by the transcripts, and the translated sequences were assigned NP_______ accession numbers. (In fact, the vast majority of protein sequences in NCBI’s Protein database have been derived by automated translation of DNA sequences, because chemical sequencing of proteins is much more laborious task than DNA sequencing.) The functions of most proteins predicted by the genome project have still not been experimentally validated. Your experiments this semester will contribute some of the missing experimental validation, when you transform met deletion mutants with plasmids carrying MET genes.