The ultimate source of information in databases is the research community, which submits their experimental data to primary databases. Primary databases ask investigators for basic information about their submission. A record that meets the standards of the database is accepted and assigned a unique accession number that will remain permanently associated with the record. Each database has its own system of accession numbers, making it possible to identify the source of a record from its accession number. Once a record is accepted into a primary database, professional curators take over. Curators are professional scientists who add value
to a record by providing links between records in different databases. Curators also organize
the information in novel ways to generate derivative databases. Derivative databases, such as organism databases, are often designed to fit the needs of particular research communities. TheSaccharomyces genome database (www.yeastgenome.org), for example, links information about genes to information about the proteins encoded by the genes and genetic experiments that explore gene function. In this course, we will be using both primary and derivative databases.
The figure on the following page summarizes information flow from the bench to data- bases. The information in databases originates in experiments. When researchers complete an experiment, they analyze their data and compile the results for communication to the research community. These communications may take several forms.
PubMed indexes publications in the biomedical sciences
Researchers will usually write a paper for publication in a scientific journal. Reviewers
at the journal judge whether the results are accurate and represent a novel finding that will advance the field. These peer-reviewed papers are accepted by the journal, which then publishes the results in print and/or online form. As part of the publication process, biomedical journals automatically submit the article citation and abstract to PubMed, a literature database maintained by NCBI. PubMed entries are assigned a PMID accession number. PMID numbers are assigned sequentially and the numbers have grown quite large. PubMed currently contains over 23 million records! PubMed users can restrict their searches to fields such as title, author, journal, publication year, reviews, and more. The usability of PubMed continues to grow. Users are able
to paste citations on a clipboard, save their searches, and arrange for RSS feeds when new search results enter PubMed. Students in the biomedical sciences need to become proficient in using PubMed. You can access PubMed at pubmed.gov or through the BC Library’s database portal. An advantage of using the library’s portal is that you will be able to use the library’s powerful “Find It” button to access the actual articles.
Information flow from experiments to databases. Researchers analyze their data and prepare manuscripts for publication. Journal citations are submitted automatically to PubMed. Researchers also submit data to more specialized, interconnected databases.
Investigators submit experimental data to specialized research databases
Depending on the experiment, researchers will submit their data to a number of different databases. Consider the hypothetical example of a researcher who has isolated a novel variant of a MET gene from a wild strain of S. cerevisiae with a sophisticated genetic screen. The researcher has sequenced the gene, cloned the gene into a bacterial overexpression plasmid, and crystallized the overexpressed protein, which possesses unique regulatory properties. The researcher is preparing a manuscript on the experiments. In preparation for the manuscript submission (reviewers of the manuscript will want to see the accession numbers), the researcher plans to submit data to three different databases: a nucleotide database, a structure database and an organism database.
If our researcher is working at an institution in the U.S., he or she will probably submit the nucleotide sequence to NCBI’s GenBank, a subdivision of the larger Nucleotide database. GenBank was founded in 1982, when DNA sequencing methods had just been developed and individual investigators were manually sequencing one gene at a time. The rate of GenBank submissions has increased in pace with advances in DNA sequencing technologies. Today, GenBank accepts computationally generated submissions from large sequencing projects as
well as submissions from individual investigators. GenBank currently contains over 300 million sequence records, including whole genomes, individual genes, transcripts, plasmids, and more. Not surprisingly, there is considerable redundancy in GenBank records. To eliminate this redundancy, NCBI curators constructed the derivative RefSeq database. RefSeq considers the whole genome sequences produced by sequencing projects to be the reference sequences for an organism. RefSeq currently contains nonredundant records for genome, transcripts and protein sequences from over 36,000 organisms.
The researcher in our hypothetical example will also want to submit the atomic coordinates and structural models for the crystallized protein to the Protein Data Bank (PDB). The PDB is part of an international consortium that accepts data for protein and nucleic acids. The vast majority of PDB records have been obtained by X-ray diffraction, although the database also accepts models obtained with nuclear magnetic resonance (NMR), electron microscopy, and other techniques. The number of entries in the PDB databases is orders of magnitude smaller than the number of predicted proteins in GenBank, reflecting the difficulties inherent in determining structures of macromolecules. PDB offers tools for visualizing macromolecules in three dimensions, allowing investigators to probe amino acid interactions that are important to protein function.
Finally, our researcher will want to submit data about the new mutant’s phenotype and information about its regulation to the Saccharomyces Genome Database (SGD). The SGD
serves as a central resource for the S. cerevisiae research community - which now includes you. The SGD is only one of many organism-specific databases. Similar databases exist for other model organisms such as the fruit fly Drosophila, the plant Arabidopsis thaliana, zebrafish and more. In addition to providing information, these specialized databases also facilitate research by providing links to important resources such as strain collections and plasmids.