1.5: Gene Annotation

Last updated
Save as PDF

Page ID: 185742

Rohan Mehta
Elmhurst University via Consortium of Academic and Research Libraries in Illinois (CARLI)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\dsum}{\displaystyle\sum\limits} \)

\( \newcommand{\dint}{\displaystyle\int\limits} \)

\( \newcommand{\dlim}{\displaystyle\lim\limits} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}} % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}} % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\(\newcommand{\longvect}{\overrightarrow}\)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

What is gene annotation?

Once you successfully obtain a DNA sequence, one of the things you might want to find out is "what is this sequence?" The process of figuring out what a particular sequence of DNA actually is is called annotation. When we're trying to see if something is a protein-coding gene, we perform gene annotation.

There are two kinds of annotation: structural annotation --- where we divide the potential gene up into regions that tell us something about the gene's structure (such as where the start/stop codons are, where the transcription factor binds, etc), and functional annotation --- where we try to figure out what the gene does (aka what the protein that the gene creates does). We'll start with structural annotation.

Structural Annotation

Finding start and stop codons

The most basic thing you can do with a DNA sequence is to find any potential start and stop codons, and then try to determine potential open reading frames (ORFs). As a reminder, a start codon is a three-letter sequence of DNA which is where transcription starts, and a stop codon is a three-letter sequence of DNA which is where transcription stops. In almost all eukaryotes and many bacteria and archaea, the start codon is ATG. The stop codons are TAA, TAG, and TGA.

Fun Fact: Different genetic codes

Not all organisms use the same genetic code that we do! The NCBI has a list of all known genetic codes here. Most of the differences are not in the start and stop codons. Almost all eukaryotes (organisms with a membrane-bound nucleus like us) use the same start codon, although there are a couple of possible rare alternatives. Archaea are a little bit more flexible, but they share common transcriptional machinery with us so they'll be more similar to us than they are to Bacteria. Bacteria can vary their start codon much more freely. Thus, when performing annotation on prokaryotic genomes, you need to account for the fact that there are multiple possible start codons and that some are more likely to be used than others.

Historical Note

The three stop codons have fun names to distinguish them! They are called amber (TAG), ochre (TAA), and opal (TGA). The first one to be discovered was called amber after the English translation of the last name of a then-graduate student, Harris Bernstein, who did most of the experimental work.[1] The next two were named after other colored minerals in keeping with the theme. This article from Scientific American has a nice historical background of the work that isolated these codons, written by the people who did the early work on it:

Edgar, R. S., and R. H. Epstein. “The Genetics of a Bacterial Virus.” Scientific American, vol. 212, no. 2, 1965, pp. 70–79. JSTOR, https://www.jstor.org/stable/24931779. Accessed 4 Sept. 2024.

The general method for finding a small sequence in a bigger sequence where there is some possibility of variation in the small sequence is by using a position-specific scoring matrix (PSSM), also called a position-weight matrix (PWM) or position-specific weight matrix (PSWM). We will try to find all the start and stop codons in an example DNA sequence using this method in a few different ways. Now, most of the time you can just eyeball a sequence to find potential start and stop codons, or literally just search for the specific sets of three letters, but this is a good place to start learning about the more general method.

Start Codon (ATG)

Let's start with this sequence:

TATGCGTTGATTCGTCTAAC

Exercise \(\PageIndex{1}\)

Eyeball the above sequence to find the positions of possible start and stop codons.

Answer: The start codon is at position 2, and there are two possible stop codons at positions 8 and 17.

Now let's construct a PSSM and do this formally. The way we construct a matrix like this is we take all the possible sequences that we're looking for and see how frequently each base comes up at each position in the sequence. For the start codon, this is easy. It's A 100% of the time, then T 100% of the time, then G 100% of the time. We represent this in matrix form by:

Base	Position 1	Position 2	Position 3
A	1	0	0
C	0	0	0
G	0	0	1
T	0	1	0

Exercise \(\PageIndex{2}\)

Create a PSSM for the stop codon TAA

Answer

Base	Position 1	Position 2	Position 3
A	0	1	1
C	0	0	0
G	0	0	0
T	1	0	0

Note

Suppose I tried to create a PSSM for "the stop codon" and came up with something like this:

Base	Position 1	Position 2	Position 3
A	0	0.67	0.67
C	0	0	0
G	0	0.33	0.33
T	1	0	0

There are multiple reasons why this is incorrect. The first is that I treated all three stop codons equally, when in reality some are more common than others. The most important reason, however, is that we know what the stop codons are exactly, and there's no reason to make this probabilistic. Any time we see either a TGA, TAG, or TAA, that's always a stop codon, and nothing else is. PSSMs assume independence of each position, and in the case of these stop codons this is not true.

Exercise \(\PageIndex{3}\)

There is one three-letter code that has positive probability using this table but is not a stop codon. Which one is that? Why does the independence assumption of PSSMs lead us to do something wrong here?

Answer: TGG. It would have a lower probability than the actual stop codons, but it still shows up as possible using this matrix. This is because positions 2 and 3 are not actually independent of each other in our stop codons. The first position is always a T, but the next two can either be A or G. However! Only the combinations AG, AA, and GA are possible; GG is not! Thus, the two positions are not independent of each other, and making a PSSM like this is incorrect.

To use a PSSM, we pick a potential starting site and use the matrix to get probabilities for each letter in the potential sequence. Let's start with the first position:

TATGCGTTGATTCGTCTAAC

The probability of T in the first position is 0, A in the second position is 0, and T in the third position is 0. So the overall probability that this triplet is a start codon is

\[ 0 \times 0 \times 0 = 0\]

If we then try, say, position 2, we get

TATGCGTTGATTCGTCTAAC

The probability of A in the first position is 1, T in the second position is 1, and T in the third position is 1. So the overall probability that this triplet is a start codon is

\[ 1 \times 1 \times 1 = 1\]

Repeating this for all the positions yields all 0s except for position 2. We have found 1 start codon (at position 2) using this method. Admittedly, this method is overkill for finding start and stop codons; generally we just search for all exact matches of the three letters. We will use PSSMs to find much more difficult-to-detect sequences in section SECTION.

Open Reading Frames

The sequence between a start and a stop codon is called an open reading frame or ORF; this is the part that is actually transcribed and turned into a protein eventually. In our example, there are two possible ORFs:

TATGCGTTGATTCGTCTAAC

and

TATGCGTTGATTCGTCTAAC

In prokaryotes, there are no introns, so the DNA sequence between the start and stop codons must be multiples of three long. In eukaryotes, we can have introns, so we cannot eliminate any potential ORFs based on length like that. But! In this case we can eliminate the second possibility for a simple reason: it has a stop codon in the middle! An ORF cannot have an in-frame stop codon in the middle; the transcription will never get through the whole sequence.

Activity 1 uses an online tool to explore finding ORFs in short sequence data. R activity 1 provides an example of annotating a full eukaryotic gene.

Regulatory Sequences

Let's look at an example of how to use a PSSM to detect a specific kind of regulatory sequence. The Kozak consensus sequence is a sequence around the start codon that is thought to optimize the starting of translation in eukaryotes. More information on Marilyn Kozak can be found in the HISTORICAL BOX. In her original paper in 1987, Marilyn Kozak tabulated the nucleotide distributions around 699 mRNA sequences from vertebrates. Here is the PSSM matrix, adapted from that paper:

Base	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10	P11	P12	P13	P14	P15	P16
A	0.23	0.26	0.25	0.23	0.19	0.23	0.17	0.18	0.25	0.61	0.27	0.15	1	0	0	0.23
C	0.35	0.35	0.35	0.26	0.39	0.37	0.19	0.39	0.53	0.02	0.49	0.55	0	0	0	0.16
G	0.23	0.21	0.22	0.33	0.23	0.20	0.44	0.23	0.15	0.36	0.13	0.21	0	0	1	0.46
T	0.19	0.18	0.18	0.18	0.19	0.20	0.20	0.20	0.07	0.01	0.11	0.09	0	1	0	0.15

The sequence with the highest probability for each position is the consensus sequence: CCC GCC GCC ACC [ATG] G, where the start codon is in brackets.

Exercise \(\PageIndex{4}\)

Use the PSSM to determine which of the following two sequences is more likely to be a translation initiation site:

CTC GAG GCC AGC [ATG] G

AAA GCT GCT ACC [ATG] G

Answer: The first one has total probability \[0.35 \times 0.18 \times 0.35 \times 0.33 \times 0.19 \times 0.20 \times 0.44 \times 0.39 \times 0.53 \times 0.61 \times 0.13 \times 0.55 = 1.096819 \times 10^{-6}\]. The second one has total probability \[0.23 \times 0.26 \times 0.25 \times 0.33 \times 0.39 \times 0.20 \times 0.44 \times 0.39 \times 0.07 \times 0.61 \times 0.49 \times 0.55 \times 0.46 = 3.495518 * 10^{-7}\], so the first one is more likely to be a translation initiation site.

Other kinds of regulatory sequences include TATA boxes, Pribnow sequences, Shine-Dalgarno sequence (for ribosome binding sites), and others.

Note: What if there's a zero probability?

In the Kozak consensus sequence example above, positions 13-15 had some bases at potentially zero probability. In this case, this is because we know that start codons must have a specific sequence. But what if there is in fact potential variation in the sequences that we just haven't observed in samples yet? If we set those to zero, we could miss out on detecting something correctly with a different variant than what we have already observed. This might be unlikely, but surely it isn't impossible!

The way we generally get around this problem is to use "pseudocounts" instead of just counting the base pairs that show up at a particular position. So, instead of counting \(B_i\), the number of base pair \(B\) at position \(i\), and dividing by \(N\), the number of samples, to get the frequency of that base at that position, we do the following:

Take \(\sqrt{N}\) and multiply it by the overall frequency of base \(B\). Add that number to \(B_i\), and divide that sum by \(N+\sqrt{N}\). That's your new frequency. This may look weird, but it comes from Bayesian statistics.

In addition to using pseudocounts, you might have noticed that the probabilities in exercise 4 are very small. What we typically do instead of multiplying all those probabilities is taking the negative log of each probability and then adding them together.

Note: State-of-the-art

The techniques we go over in this chapter are fairly basic. The technique that is currently used by structural annotation software uses Hidden Markov Models (HMMs), which are beyond the scope of this textbook. For now, this tutorial is a good place to start.

Introns and Exons

Exons (aka the protein-coding regions of a gene) are typically bookended by specific sequences. An exon can start right after the start codon, or it can start at a 3' "acceptor" site after an intron. This acceptor site typically has the form CAG/G, where nucleotides upstream of the slash (which is the cut site) are in the intron and nucleotides downstream of the cut site are part of the exon. The most highly-conserved part of this pattern is the "AG" before the cut site, which almost never changes.

Exons can end either just before stop codon or at a 5' "donor" site, which typically has the sequence GG/GT (or GG/GU for RNA). In this case, nucleotides upstream of the slash are part of the exon, and nucleotides downstream are part of an intron. The most highly-conserved part of this pattern is the "GT" after the cut site, which almost never changes.

Because these distinctive sequences are so short, it's not super useful make a PSSM for them. Most annotation techniques today use Hidden Markov Models (HMMs), which have much more complicated models of predicting when something is an exon or an intron.

Exercise \(\PageIndex{5}\)

What is an alternative way to find out if something is an exon without even necessarily looking at the sequence specifically? (Hint: what does an exon do that an intron doesn't)?

Answer: Introns are spliced out of mRNA and the exons are kept in. So, just sequence post-splicing mRNA and align it to the DNA. The exons will match the mRNA, and the introns won't match anything! This technique is a common way to verify if something is an exon. It cannot, however, be used to say that something is not an exon, because many genes are turned off or downregulated in different tissues or at different times and dobn't actually produce mRNA all the time.

Other Types of Sequences

There are many parts of a genome that are not protein-coding. For instance, some parts of the genome code for tRNA and rRNA. These sequences are very highly conserved and their patterns are recongizable enough that we have specific tools to identify them that work extremely well. So, typically, when you want to annotate a genome, one of the first things you do is run one of these tools on the genome to identify tRNA- and rRNA-coding DNA, and then ignore those for the rest of the annotation process.

Bacteria, like everything else, are constantly under attack from viruses. These viruses (called "bacteriophages") often leave pieces of their genome inside the bacterial genome, and these sequences also have specific patterns that we can look for. (We can also, like with tRNA and rRNA sequences, compare them to a relatively small database of known sequences). So when annotating bacterial genomes, an early step is often to look for these "prophage" genes.

Eukaryotes have a lot of repetitive DNA. Some of this DNA, called short tandem repeats (STRs) or "microsatellites", consists of short sequences repeated over and over. These are relatively easy to identify and then "mask" (or ignore) for the rest of the annotation process because of their simple structure. Other kinds of repetitive DNA where the repeats are not all lined up next to each other (such as those resulting from transposon activity) are more complex, and usually we do the same strategy we use with all of these other types of sequences, which is we compare against the set of known transposon sequences with potential variation.

Search

Text Color

Text Size

Margin Size

Font Type

Fun Fact: Different genetic codes

Historical Note

Exercise \(\PageIndex{1}\)

Exercise \(\PageIndex{2}\)

Note

Exercise \(\PageIndex{3}\)

Exercise \(\PageIndex{4}\)

Note: What if there's a zero probability?

Note: State-of-the-art

Exercise \(\PageIndex{5}\)