4.3 Sequencing the Human Genome
In humans, each cell normally contains 23 pairs of chromosomes, for a total of 46. Twenty-two of these pairs, called autosomes, and look the same in both males and females. The 23rd pair, the sex chromosomes, differ between males and females. Females have two copies of the X chromosome, while males have one X and one Y chromosome (Figure 4.18). Each species has a unique chromosomal complement. For example, chickens have 39 pairs of chromosomes (78 total) with 38 autosomal pairs and one pair of sex chromosomes (Z and W). In this species, ZW chickens are female and ZZ chickens are male.
The total length of the human genome is over 3 billion base pairs. The total length of the human genome is over 3 billion base pairs. The genome also includes the mitochondrial DNA (Figure 4.12).
Figure 4.18 DNA Karyotype. The 22 autosomes are numbered by size. The other two chromosomes, X and Y, are the sex chromosomes. This picture of the human chromosomes lined up in pairs is called a karyotype. Karyotypes are prepared using standardized staining procedures that reveal characteristic structural features for each chromosome, usually from white blood cells.
Image by: U.S. National Library of Medicine
The first human genome sequences were published in nearly complete draft form in February 2001 by the Human Genome Project and Celera Corporation. Completion of the Human Genome Project's sequencing effort was announced in 2004 with the publication of a draft genome sequence. Researchers working on the human genome project deciphered the human genome in three major ways: determining the order, or "sequence," of all the bases in our genome's DNA; making maps that show the locations of genes for major sections of all our chromosomes; and producing what are called linkage maps, through which inherited traits (such as those for genetic disease) can be tracked over generations.
Prior to the acquisition of the full genome sequence, estimates of the number of human genes ranged from 50,000 to 140,000 (with occasional vagueness about whether these estimates included non-protein coding genes). As genome sequence quality and the methods for identifying protein-coding genes improved, the count of recognized protein-coding genes dropped to 19,000-20,000. However, a fuller understanding of the role played by genes expressing regulatory RNAs that do not encode proteins has raised the total number of genes to at least 46,831, plus another 2300 micro-RNA genes. By 2012, functional DNA elements that encode neither RNA nor proteins have also been noted. Protein-coding sequences account for only a very small fraction of the genome (approximately 1.5%), and the rest is associated with non-coding RNA genes, regulatory DNA sequences, long interspersed nucleotide elements (LINEs), short interspersed nucleotide elements (SINEs), introns, and sequences for which as yet no function has been determined.
Recall that a gene is defined as a sequence of nucleotides in DNA or RNA that codes for a molecule that has a function. During gene expression, the DNA is first copied into RNA. The RNA can be directly functional or be the intermediate template for a protein that performs a function. Gene structure is the organisation of specialized sequence elements within a gene (Figure 4.19). Genes contain the information necessary for living cells to survive and reproduce. The processes of transcription which leads to the production of the RNA from the DNA template, and translation which produces protein from the messenger RNA (mRNA) sequence are controlled by specific sequence elements or regions within the gene. Every gene, therefore, requires multiple sequence elements to be functional. This includes the sequence that actually encodes the functional protein or ncRNA, as well as multiple regulatory sequence regions. These regions may be as short as a few base pairs, up to many thousands of base pairs long.
Much of gene structure is broadly similar between eukaryotes and prokaryotes. These common elements largely result from the shared ancestry of cellular life in organisms with roughly 3.8 billion years of evolution. Key differences in gene structure between eukaryotes and prokaryotes reflect their divergent transcription and translation machinery. Understanding gene structure is the foundation of understanding gene annotation, expression, and function.
Figure 4.19. The Process of Eukaryotic Gene Expression. Upper blue panel shows the structural elements common to eukaryotic genes. The process of gene transcription produces a messenger RNA (mRNA) molecule that must be modified post-translationally, gray panel, to remove the non-coding intron sequences and add the 5'-CAP and Poly-A-Tail sections. The mature mRNA is transported from the nucleus to the cytoplasm where it is translated by the ribosome into the protein sequence, red panel.
Image from Wikipedia
The structures of both eukaryotic and prokaryotic genes involve several nested sequence elements. Each element has a specific function in the multi-step process of gene expression. The sequences and lengths of these elements vary, but the same general functions are present in most genes. Although DNA is a double-stranded molecule, typically only one of the strands encodes information that the RNA polymerase reads to produce protein-coding mRNA or non-coding RNA. This 'sense' or 'coding' strand, runs in the 5' to 3' direction where the numbers refer to the carbon atoms of the backbone's ribose sugar. The open reading frame (ORF) of a gene is therefore usually represented as an arrow indicating the direction in which the sense strand is read.
Regulatory sequences are located at the extremities of genes. These sequence regions can either be next to the transcribed region (the promoter) or separated by many kilobases (enhancers and silencers). The promoter is located at the 5' end of the gene and is composed of a core promoter sequence and a proximal promoter sequence. The core promoter marks the start site for transcription by binding RNA polymerase and other proteins necessary for copying DNA to RNA. The proximal promoter region binds transcription factors that modify the affinity of the core promoter for RNA polymerase. Genes may be regulated by multiple enhancer and silencer sequences that further modify the activity of promoters by binding activator or repressor proteins. Enhancers and silencers may be distantly located from the gene, many thousands of base pairs away. The binding of different transcription factors, therefore, regulates the rate of transcription initiation at different times and in different cells.
Regulatory elements can overlap one another, with a section of DNA able to interact with many competing activators and repressors as well as RNA polymerase. For example, some repressor proteins can bind to the core promoter to prevent polymerase binding. For genes with multiple regulatory sequences, the rate of transcription is the product of all of the elements combined. Binding of activators and repressors to multiple regulatory sequences has a cooperative effect on transcription initiation.
An additional layer of regulation occurs for protein coding genes after the mRNA has been processed to prepare it for translation to protein. Only the region between the start and stop codons encodes the final protein product. The flanking untranslated regions (UTRs) contain further regulatory sequences. The 3' UTR contains a terminator sequence, which marks the endpoint for transcription and releases the RNA polymerase. The 5’ UTR binds the ribosome, which translates the protein-coding region into a string of amino acids that fold to form the final protein product. In the case of genes for non-coding RNAs the RNA is not translated but instead folds to be directly functional.
The structure of eukaryotic genes includes features not found in prokaryotes. Most of these relate to post-transcriptional modification of pre-mRNAs to produce mature mRNA ready for translation into protein. Eukaryotic genes typically have more regulatory elements to control gene expression compared to prokaryotes. This is particularly true in multicellular eukaryotes, where gene expression varies widely among different tissues.
A key feature of the structure of eukaryotic genes is that their transcripts are typically subdivided into exon and intron regions. Exon regions are the coding portion of the mRNA and are retained in the final mature mRNA molecule, while intron regions are non-coding and are spliced out (excised) during post-transcriptional processing. Indeed, the intron regions of a gene can be considerably longer than the exon regions. Once spliced together, the exons form a single continuous protein-coding region, and the splice boundaries are not detectable. Eukaryotic post-transcriptional processing also adds a 5' cap to the start of the mRNA and a poly-adenosine tail (poly-A-tail) to the end of the mRNA. These additions stabilize the mRNA and direct its transport from the nucleus to the cytoplasm.
The overall organization of prokaryotic genes is markedly different from that of the eukaryotes. The most obvious difference is that prokaryotic ORFs are often grouped into a structure that is called a polycistronic operon under the control of a shared set of regulatory sequences (Figure 4.20). These ORFs are all transcribed onto the same mRNA and so are co-regulated and often serve related functions. Each ORF typically has its own ribosome binding site (RBS) so that ribosomes simultaneously translates the different ORFs on the same mRNA. Some operons also display translational coupling, where the translation rates of multiple ORFs within an operon are linked. This can occur when the ribosome remains attached at the end of an ORF and simply translocates along to the next without the need for a new RBS. Translational coupling is also observed when translation of an ORF affects the accessibility of the next RBS through changes in RNA secondary structure. Having multiple ORFs on a single mRNA is only possible in prokaryotes because their transcription and translation take place at the same time and in the same subcellular location.
Figure 4.20 The Process of Prokaryotic Gene Expression. Upper blue panel displays the organization of a typical prokaryotic polycistronic operon, wherein mulitiple genes are regulated by common regulatory elements and are transcribed as a single mRNA. Unlike eukaryotic systems, there is little to no post-translational modification of the resulting mRNA and protein translation, red panel, often ensues before transcription is complete.
Image from Wikipedia
The operator sequence next to the promoter is the main regulatory element in prokaryotes. Repressor proteins bound to the operator sequence physically obstructs the RNA polymerase enzyme, preventing transcription. Riboswitches are another important regulatory sequence commonly present in prokaryotic UTRs. These sequences switch between alternative secondary structures in the RNA depending on the concentration of key metabolites. The secondary structures then either block or reveal important sequence regions such as ribosomal binding sites. Introns are extremely rare in prokaryotes and therefore do not play a significant role in prokaryotic gene regulation.
Overall, the packaging, unpackaging, replication and transcription of DNA is a highly dynamic process that is constantly being moderated by signals and cues from the environment. The following video created by Drew Berry for WEHI.tv provides one of the most dynamic views of the major packaging and processing of DNA within the cell. In later chapters we will explore the processes of DNA replication and transcription in greater detail.
[video width="1280" height="720" mp4="https://wou.edu/chemistry/files/2020...ation-ever.mp4"][/video]
Video 4.1 DNA animations by wehi.tv for Science-Art exhibition. By far one of the best animations depicting DNA packaging, replication and transcription.