# 4.7: Comparative Genome Analysis

## Paralogous Genes

• Genes that are similar because of descent from a common ancestor are homologous.
• Homologous genes that have diverged after speciation are orthologous.
• Homologous genes that have diverged after duplication are paralogous.

One can identify paralogous groups of genes encoding proteins of similar but not identical function in a species e.g., ABC transporters: 80 members in E. coli

Core proteomes vary little in size

Proteome: all the proteins encoded in a genome

To calculate the Core proteome:

Count each group of paralogous proteins only once

Number of distinct protein families in each organism

Species Number of genes Core proteome
Haemophilus 1709 1425
Yeast 6241 4383
Worm 18424 9453
Fly 13601 8065

Figure 4.22.Little change in core proteome size in eukaryotes

Core proteomes are conserved

• Many of the proteins in the core proteomes are shared among eukaryotes
• 30% of fly genes have orthologs in worm
• 20% of fly genes have orthologs in both worm and yeast
• 50% of fly genes have likely orthologs in mammals

Function of proteins in flies (and worms and yeast) provides strong indicators of function in humans. Flies have orthologs to 177 of the 289 human disease genes

Figure 4.23. Functional categories in eukaryotic proteomes

Figure 4.24. Distribution of the homologues of the predicted human proteins

## Conserved Segments in the Human and Mouse Genomes

Figure 4.25. Regions of human chromosomes homologous to regions of mouse chromosomes (indicated by the colors). For example, virtually all of human chromosome 20 is homologous to a region on mouse chromosome 2, and almost all of human chromosome 17 is homologous to a region on mouse chromosome 11. More commonly, segments of a given human chromosomes are homologous to different mouse chromosomes. Chromsosomes from mouse have more rearrangements relative to humans than do chromosomes from many mammals, but the homologous relationships are still readily apparent.

## CHROMOSOMES and CHROMATIN

Chromosomes are the cytological package for genes. Genomes are much longer than the cellular compartment they occupy compartment dimensions length of DNA

• Phage T4: $0.065 \times 0.10 \,mm 55\, mm = 170\, kb$
• E. coli: $1.7 \times 0.65\, mm \,1.3\, mm = 4.6 \times 10^3\, kb$
• Nucleus (human): $6 mm \,diam. 1.8\, m = 6 \times 10^6\, kb$

Definition: Packing ratio

$\text{Packing ratio} = \dfrac{\text{length of DNA}}{\text{length of the unit that contains it}}.$

The smallest human chromosome contains about

$46 \times 10^6\, bp = 14,000\, mm = 1.4\,cm \,DNA.$

When condensed for mitosis, this chromosome is about. 2 mm long. The packing ratio is therefore about 7000!

## Loops, matrix and the chromosome scaffold

When DNA is released from mitoticchromosomes by removing most of the proteins, long loops of DNA are seen, emanating from a central scaffold that resembles the remnants of the chromosome.

Figure 4.26: EM analysis of intact nuclei shows network of fibers called a matrix.

Biochemical preparations using salt and detergent to remove proteins and nuclease to remove most of the DNA leaves a "matrix" or "scaffold" preparation. Similar DNA sequences are found in these preparations; these sequences are called matrix attachment regions = MARs (or scaffold attachment regions = SARs). They tend to be A+T rich and have sites for cleavage by topoisomerase II. Topoisomerase II is one of the major components of the matrix preparation; but the composition of the matrix is still in need of further study.

Since it is attached at the base to the matrix, each loop is a separate topological domain and can accumulate supercoils of DNA.

From the measured sizes of loops, and calculations based on the amount of nicking required to relax DNA within the loops, we estimate that the average size of these loops is about 100 kb (85 kb based on nicking frequency for relaxation).

Some evidence suggests that replication and possibly some transcriptional control may be exerted at the bases of the loops.

### Interphase chromatin and mitotic chromosomes

During interphase, i.e. between mitotic divisions, the highly condensed mitotic chromosomesspread out through the nucleus to form chromatin. Interphase chromatin is not very densely packed in most of the nucleus (euchromatin). In some regions it is very densely packed, comparable to a mitotic chromosome (heterochromatin).

Both interphase chromatin and mitotic chromosomes are made of a 30 nm fiber. The mitotic chromosome is much more coiled than interphase chromosomes.

### Most transcription occurs in euchromatin.

• Constitutive heterochromatin = nonexpressed regions that are condensed (compact) in all cells (e.g. centromeric simple repeats)
• Facultative heterochromatin = inactive in only some cell lineages, active in others.

One example of heterochromatin is the inactive X chromosome in female mammals. The choice of which X chrosomosome to inactivate is random in various cell lineages, leading to a mosaic phenotypes for some X-linked traits. For instance, one genetic determinant of coat color in cats is X-linked, and the patchy coloration on calico cats results from this random inactivation of one of the X chromosomes, leading to the lack of expression of this determinant in some but not all hair cells.

### Cytologically visible bands in chromosomes

G bands and R bands in mammalian mitotic chromosomes (Figure 4.27)

Giemsa‑dark (G) bands tend to be A+T rich, with a large number of L1 repeats.

Giemsa‑light bands tend to be more G+C rich, with very few L1 repeats and many Alu repeats.

(R bands are about the same as Giemsa-light bands. They are visualized by a different preparative procedure so that the "reverse" of the Giemsa-stained images are seen.)

T bands are adjacent to telomeres, do not stain with Giemsa, and are extremely G+C rich, with lots of genes and myriad Alu repeats.

The functional significance of these bands is still under active investigation.

One can localizea gene to a particular region of a chromosome by in situhybridization with a radioactive or, now more commonly, fluorescent probe for the gene. The region of hybridization is determined by simultaneously viewing the stained banding pattern and the hybridization pattern. Many spreads of mitotic chromosomes are viewed and scored, and the gene is localized to the chromosomal region with a significantly greater incidence of hybridization signal than that seen to the rest of the chromosomes.

Another common method of mapping the location of genes is by hybridization to DNA isolated from a panel of somatic cell hybrids, each hybrid cell carrying a small subset of, e.g., human chromosomes on a hamster background. Some hybrid cells carry broken human chromosomes, which allows even more precise localization (see Figure 1.8.2, "J-1 series").

### Polytene chromosomes are visible in several Drosophilatissues

These contain many copies of the chromosomes, side by side in register. Thus most chromosomal regions are highly amplified in these tissues. Chromosomal stains reveal characteristic banding pattern, which is the basis for the cytological map. The cytological map (of polytene bands) combined with the genetic map gives a cytogenetic map, which is a wonderful guide to the Drosophila genome. One can localize a gene to a particular region by in situ hybridization (in fact the technique was invented using Drosophilapolytene chromoomes.

### Multiple genes per band on mammalian chromosomes

Figure 4.27 gives a view of human chromosome 11 at several different levels of resolution. The region 11p15 has many genes of interest, including genes whose products regulate cell growh (HRAS), determination and differentiation of muscle cells (MYOD), carbohydrate metabolism (INS), and mineral metabolism (PTH). The b-globin gene (HBB) and its closely linked relatives are also in this region. A higher resolution view of 11p15, based on a compilation of genetic and physical mapping (Cytogenetics and Cell Genetics, 1995) is shown next to the classic ideogram (banding pattern). This is in a scale of millions of base pairs, and one can start to get a feel for gene density in this region. Interestingly, it varies quite a lot, with the gene-dense sub-bands near the telomeres; these may correspond to the T-bands discussed above. Other genes appear to be more widely separated. For instance, each of the b-like globin genes is separated by about 5 to 8 kb from each other (see the map of the YAC, or yeast artificial chromosome, carrying the b-like globin genes), and this gene cluster is about 1000 kb (i.e. 1 Mb) from the nearest genes on the map. However, further mapping will likely find many other genes in this region. Now even more information is available at the web sites mentioned earlier.

Figure 4.27.

The relationship between recombination distances and physical distances varies substantially among organisms. In human, one centiMorgan (or cM) corresponds to roughly 1 Mb, whereas in yeast 1 cM corresponds to about 2 kb, and this value varies at least 10-fold along the different yeast chromosomes. This is a result of the different frequencies of recombination along the chromosomes.

### Specialized regions of chromosomes

Centromere: region responsible for segregation of chromosomes at mitosis and meiosis. The centromere is a constricted region (usually) toward the center of the chromosome (although it can be located at the end, as with mouse chromosomes.) It contains a kinetochore, a fibrous region to which microtubules attach as they pull the chromosome to one pole of the dividing cell. DNA sequences in this region are highly repeated simple sequences (in Drosophila, the unit of the repeat is about 25 bp long, repeated hundreds of times). Specific proteins are at the centromere, and are now intensely investigated.

Telomere: forms the ends of the linear DNA molecule that makes up the chromosome. The telomeres are composed of thousands of repeats of CCCTAA in human. Variants of this sequence are found in the telomeres in other species. Telomeres are formed by telomerase; this enzyme catalyzed the synthesis of more ends at each round of replication to stabilize linear molecules.

## The Principal Proteins in Chromatin are Histones

Composition of chromatin: Various biochemical methods are avialable to isolated chromatin from nuclei. Chemical analysis of chromatin reveals proteins and DNA, with the most abundant proteins being the histones. A complex set of less abundant histones are referred to as the nonhistone chromosomal proteins.

The histones and DNA present in equal masses.

Mass Ratio DNA: histones: nonhistone proteins: RNA = 1: 1: 1: 0.1

Histones are small, basic (positively charged), highly conserved proteins. They bind to each other to form specific complexes, around which DNA wraps to form nucleosomes. The nucleosomes are the fundamental repeating unit of chromatin.

There are 5 histones, 4 in the core of the nucleosome and one outside the core.

H3, H4: Arg rich, most conserved sequence ü

ý CORE Histones

H2A, H2B: Slightly Lys rich, fairly conservedþ

H1: very Lys rich, most variable in sequence between species.

X-ray diffraction studies of histone complexes and the nucleosome core have provided detailed insight into how histones interact with each other and with DNA in this fundamental entity of chromatin structure.

Key reference: "Crystal structure of the nucleosome core particle at 2.8 Å resolution" by Luger, K. Mader, A., Richmond, R.K., Sargent, D.F. & Richmond, T.J. in Nature 389: 251-260 (1997)

### Histone Interactions via the Histone fold

The core histones have a highly positively charged amino-terminal tail, and most of the rest of the protein forms an a-helical domain. Each core histone has at least 3 a-helices.

Figure 4.28

The a-helical domain forms a characteristic histone fold, in which shorter a1 and a3 helices are perpendicular to the longer a2 helix. The a-helices are separated by two loops, L1 and L2. The histone fold is the dimerization domain between pairs of histones, mediating the formation of crescent-shaped heterodimers H3-H4 and H2A-H2B. The histone-fold motifs of the partners in a pair are antiparallel, so that the L1 loop of one is adjacent to the L2 loop of the other.

Figure 4.29

A structure very similar to the histone fold has now been seen in other nuclear proteins, such as some subunits of TFIID, a key component in the general transcription machinery of eukaryotes. It also serves as a dimerization domain for these proteins.

Two H3-H4 heterodimers bind together to form a tetramer.

## Nucleosomes are the Subunits of the Chromatin Fiber

The most extended chromatin fiber is about 10 nm in diameter. It is composed of a series of histone-DNA complexes called nucleosomes.

Principal lines of evidence for this conclusion are:

1. Observations of this 10 nm fiber in the electron microscope showed a series of bodies that looked like beads on a string. We now recognize the beads as the nucleosomal cores and the string as the linker between them.
2. Digestion of DNA in chromatin or nuclei with micrococcal nuclease releases a series of products that contain DNA of discrete lengths. When the DNA from the products of micrococcal nuclease digestion was run on an agarose gel, the it was found to be a series of fragments of 200 bp, 400 bp, 600 bp, 800 bp, etc. , i.e. integral multiples of 200 bp. This showed that cleavage by this nuclease, which has very little sequence specificity, was restricted to discrete regions in chromatin. Those regions of cleavage are the linkers.
3. Physical studies, including both both neutron diffraction and electron diffraction data on fibers and most recently X-ray diffraction of crystals, have provided more detailed structural information.

2. The nucleosomal core is composed of an octamer of histones with 146 bp of duplex DNA wrapped around it in 1.65 very tight turns. The octamer of histones is actually a tetramer H32H42 at the central axis, flanked by two H2A-H2B dimers (one at each end of the core.

Figure 4.30. Schematic views of the nucleosomal core

The 10 nm fiber is composed of a string of nucleosomal cores joined by linker DNA. The length of the linker DNA varies among tissues within an organism and between species, but a common value is about 60 bp. The nucleosome is the core plus the linker, and thus contains about 200 bp of DNA.

Figure 4.31. A string of nucleosomes

Detailed structure of the nucleosomal core.

### Path of the DNA and tight packing

The 146 bp of DNA is wrapped around the histone octamer in 1.65 turns of a flat, left-handed torroidal superhelix. Thus 14 turns or "twists" of the DNA are in the 1.65 superhelical turns, presenting 14 major and 14 minor grooves to the histone octamer. Pancreatic DNase I will cleave DNA on the surface of the core about every 10 bp, when each twist of the DNA is exposed on the surface.

The DNA superhelix has an average radius of 41.8 Å and a pitch of 23.9 Å. This is a very tight wrapping of the DNA around the histones in the core - note that the duplex DNA on one turn is only a few Å from the DNA on the next turn! The DNA is not uniformly bent in this superhelix. As the DNA wraps around the histones, the major and then minor grooves are compressed, but not in a uniform manner for all twists of the DNA. G+C rich DNA favors the major groove compression, whereas A+T rich DNA favors the minor groove compression. This is an important feature in translational positioning of nucleosomes and could also affect the affinity of different DNAs for histones in nucleosomes.

The DNA phosphates have high mobility when not contacting histones; the DNA phosphates facing the solvent are much more mobile than is seen with other protein-DNA complexes.

Figure 4.32. A cross-sectional view of the nucleosome core showing histone heterodimers and contacts with DNA. This images corresponds to the proteins and DNA in about one half of the nucleosome.

The left-handed torroidal supercoils of DNA in nucleosomal cores is the equivalent of a right-handed, hence negative, supercoil. Thus the DNA in nucleosomes is effectively underwound.

Figure 4.33.

### Histones in the nucleosome core particle

The protein octamer is composed of four dimers (2 H2A-H2B pairs and 2 H3-H4 pairs) that interact through the "histone fold". The two H3-H4 pairs interact through a 4-helix bundle formed between the two H3 proteins to make the H32H42 tetramer. Each H2A-H2B pair interacts with the H32H42 tetramer through a second 4-helix bundle between H2B and H4 histone folds.

The histone-fold regions of the H32H42 tetramer bind to the center of of the DNA covering a total of about 6 twists of the DNA, or 3 twists of DNA per H3-H4 dimer. Those of the H2A-H2B dimers cover a comparable amount of DNA, 3 twists per dimer. Additional helical regions extend from the histone fold regions and are an integral part of the the core protein within the confines of the DNA superhelix.

### Histone-DNA interactions in the core particle.

The histone-fold domain of the heterodimers (H3-H4 and H2A-H2B) bind 2.5 turns of DNA double helix, generating a 140˚ bend. The interaction with DNA occurs at two types of sites:

1. The L1 plus L2 loops at the narrowly tapered ends of each heterodimer form a similar DNA binding site for each histone pair. The L1-L2 loops interact with DNA at each end of the 2.5 turns of DNA.
2. The a1 helices of each partner in a pair form the convex surface in the center of the DNA binding site. The principal interactions are H-bonds between amino acids and the phosphate backbone of the DNA (there is little sequence specificity to histone-DNA binding). However, there are some exceptions, such a hydrophobic contact between H3Leu65 and the 5-methyl in thymine. An Arg side chain from a histone fold enters the minor groove at 10 of the 14 times it faces the histone octamer. The other 4 occurrences have Arg side chains from tail regions penetrating the minor groove.

### Histone Tails

The histone N- and C-termial tails make up about 28% of the mass of the core histone proteins, and are seen over about 1/3 of their total length in the electron density map - i.e. that much of their length is relatively immobile in the structure.

The tails of H3 and H2B pass through channels in the DNA superhelix created by 2 juxtaposed minor grooves. One H4 tail segment makes a strong interparticle connection, perhaps relevant to the higher-order structure of nucleosomes.

The most N-terminal regions of the histone tails are not highly ordered in the X-ray crystal structure. These regions extend out from the nucleosome core and hence could be involved in interparticle interactions. The sites for acetylation and de-acetylation of specific lysines are in these segments of the tails that protrude from the core. Post-translational modifications such as acetylation have been implicated in "chromatin remodeling" to allow or aid transcription factor binding. It seems likely that these modifications are affecting interactions between nucleosomal cores, but not changing the structure of the core particle.

• Some excellent resources are available on the World Wide Webfor visualizing and further investigating chromatin structure and its involvment in nuclear processes.
• Dmitry Pruss maintains a site with many good images, including dynamic, step-by-step view of the nuclesomal core beginning with the histone fold domains and ending with a complete core, with DNA. http://www.average.org/~pruss/nucleosome.html
• Another good site is from J.R. Bone: http://rampages.onramp.net/~jrbone/chrom.html

### Higher order chromatin structure

1. The 10 nm fiber composed of nucleosomal cores and spacers is folded into higher order structures for much of the DNA in chromatin. In fact, the 10 nm fiber with the beads-on-a-string appearance in the electron microscope was prepared at very low salt concentrations and is free of histone H1.
2. In the presence of H1 and at more physiological salt concentrations, chromatin forms a 30 nm fiber. The exact structure of this fiber remains a point of considerable debate, and one cannot rule the possibility of multiple structure in this fiber.
3. One reaonable model is that the 10 nm fiber coils around itself to generate a solenoid that is 30 nm in diameter, with 6 nucleosomes per turn of of the solenoid.

Histone H1 binds to the outer surface of the nucleosomal core, interacting at the points of DNA entry and exit. H1 molecules can be cross-linked to each other with chemical reagents, indicating that the H1 proteins also interact with each other. Interactions between H1 proteins, each bound to a nucleosomal core, may be one of the forces driving the formation of the 30 nm fiber.

Figure 4.34. Model for one turn of the solenoid in the 30 nm fiber.

4. Each level of chromatin structure produces a more compact arrangment of the DNA. This can be described in terms of a packing ratio, which is the length of the DNA in an extended state divided by the length of the DNA in the more compact state.

For the 10 nm fiber, the packing ratio is about 7, i.e. there are 7mm of DNA per mm of chromatin fiber. The packing ratio in the core is higher (see problems), but this does not include the additional, less compacted DNA in the spacer. In the 30 nm fiber, the packing ratio is about 40, i.e. there 40mm DNA per mm of chromatin fiber.

5. The 30 nm fiber is probably the basic constituent of both interphase chromatin and mitotic chromosomes. It can be compacted further by additional coils and loops. One of the key issues in gene regulation is the nature of the chromating fiber in transcriptionally acative euchromatin. Is it the 10 nm fiber? the 30 nm fiber? some modification of the latter? or even some higher order structure? These are topics for current research.

## Contributors

• Ross C. Hardison, T. Ming Chu Professor of Biochemistry and Molecular Biology (The Pennsylvania State University)