Activity 1-3 - Genetic evolution and Identifying Homologs
( \newcommand{\kernel}{\mathrm{null}\,}\)
- Define what BLAST and BLASTp are and why they are used in evolutionary biology.
- Explain the difference between homologous and analogous sequences.
- Describe how protein conservation can reveal functional and evolutionary insights.
- Recognize the strengths of BLASTp in detecting distant evolutionary relationships.
Students should read:
- The introductory text on BLAST and BLASTp provided in the chapter.
- The example comparing jaw bones and ear bones in vertebrates.
- Definitions of homologs vs. analogs.
- Overview of how BLASTp works (breaking sequences into words, E-values, identity %).
- BLAST (Basic Local Alignment Search Tool): A bioinformatics tool that finds regions of similarity between biological sequences.
- BLASTp: A version of BLAST that compares protein sequences, useful for finding distant evolutionary relationships.
- Homologs: Genes or proteins that share a common ancestor and often a similar function.
- Analogs: Genes or proteins with similar function but different evolutionary origins (convergent evolution).
- E-value (Expect Value): A statistical measure that estimates the likelihood a match occurred by chance. Smaller = more significant.
- Identity %: The percentage of amino acids that exactly match between two protein sequences.
- Query Coverage: The percentage of your protein that aligns with a match in the database.
- Functional Domain: A conserved region in a protein that is critical for its biological function.
Exploring Protein Homology Using BLASTp
What is BLAST?
BLAST (Basic Local Alignment Search Tool) is a powerful Bioinformatics tool used to compare your sequence (DNA or protein) against a vast database of sequences stored in NCBI. It helps identify similar sequences, called homologs, which likely share a common evolutionary origin. There are different types of BLAST:
- BLASTn compares nucleotide sequences (DNA or RNA).
- BLASTp compares protein sequences (amino acids).
In this lab, we use BLASTp. Why? Because proteins evolve more slowly than DNA, and BLASTp is better at recognizing distant evolutionary relationships. Also, amino acids have different chemical properties, which helps the software make better alignments—even when the DNA changes, the protein may still “look” similar functionally. Imagine a mutation changes a DNA codon, but the new codon still codes for an identical amino acid (like glutamic acid → aspartic acid). BLASTp recognizes the similarity based on function, even when the nucleotide sequence looks different. BLASTn would miss this.
Here’s an interesting example of protein homology and evolution: In mammals, the tiny bones in your middle ear (malleus, incus, and stapes) evolved from jaw bones in reptilian ancestors. The malleus and incus were once part of the jaw joint in early vertebrates. Over time, they shifted function and became part of the hearing apparatus in mammals. So if you compare the proteins involved in forming middle ear bones in humans and jawbones in reptiles, you might find homologous proteins—same evolutionary origin, different anatomical roles. This is a great example of how protein conservation gives us clues about evolutionary history.
- Homologs: Sequences that are similar because they come from a common ancestor. (e.g., human and chimp hemoglobin).
- Analogs: Sequences or structures that serve similar functions but evolved independently—a process called convergent evolution. (e.g., bat wings and insect wings).
How BLASTp Works
When you paste your protein into BLASTp, BLAST breaks the sequence into short segments (called "words"). It scans the database looking for matching words. Once a match is found, it extends the alignment on both sides. It scores the quality of alignment using: Identity (% of amino acids that match exactly) and E-value (probability the match is just by chance—the smaller, the better). BLAST ranks and displays the best matches (homologs) in a user-friendly way: A colorful graphic (where red = best hit), a description table with identity %, E-value, and coverage, and the actual sequence alignments.
Lab Protocol
- Find Your Protein
- Go to NCBI Protein Database (https://www.ncbi.nlm.nih.gov/protein)
- Search for your protein of interest (e.g., "sonic hedgehog").
- Click on the result and look for the GenBank accession number (e.g., NP_000257.2, BAA33523.2, etc).
- Launch BLASTp
- On the right side of the page, under “Analyze this sequence,” click "Run BLAST".
- This sends your protein into the BLASTp query form automatically.
- Set Up Your Search
- Confirm that BLASTp is selected (top left).
- Under Organism, search for one species at a time (e.g., Arabidopsis thaliana, E. coli, etc.).
- Click "BLAST" at the bottom to begin.
- Review Your Results. You’ll see three main sections:
- Graphic Summary: Shows where the best hits align. Red bars mean highly similar regions.
- The Graphic Summary is a visual snapshot showing how well other proteins in the database match your query protein. The position of each bar shows where in your protein the match occurred, and the color of the bar indicates how strong that match is. Colors are used to represent similarity scores: red bars indicate very strong matches (high similarity), orange or pink bars show moderate matches, green or blue bars reflect lower similarity, and black or gray bars represent weak or non-significant matches. For example, if you see a red bar for a chimpanzee protein and a green bar for a fruit fly protein, it suggests that the chimpanzee version is highly similar to your query protein (likely the same function), while the fruit fly version is more distantly related, possibly sharing only one conserved domain.
- Descriptions Table: Includes Identity %, E-value (smaller is better), and Query Coverage.
- The Descriptions Table gives you a detailed breakdown of each top “hit”—each protein that had some level of similarity to your query. The Max Score tells you how well the sequences align overall, with higher scores indicating better matches. Query Coverage shows the percentage of your protein that aligned with the match; a higher coverage means a longer stretch of similarity. The E-value (or Expect value) is a critical measure: it tells you the probability that this match occurred by random chance. The lower the E-value, the more significant the match is—values close to 0.0 suggest that the sequences are almost certainly homologous. Finally, the Percent Identity tells you how similar the actual amino acid sequences are; values above 90% usually indicate the protein is nearly identical and probably has the same function in different organisms. For instance, if your human protein aligns with a mouse protein with 98% identity, 100% query coverage, and an E-value of 0.0, that means the two proteins are almost identical—likely performing the same function in both species.
- Alignments: Detailed comparison of your protein to others.
- The Alignments section lets you dig into the exact amino acid-by-amino acid comparison between your protein and the matching protein. It shows whether the amino acids are exactly the same (identical residues) or if they are biochemically similar (conservative substitutions), and highlights regions that match. This is where you can see not only that proteins are similar but also which parts are similar—which is key to understanding whether functional domains are conserved. This detailed view helps you ask deeper biological questions: for example, if only the middle part of the protein aligns across species, maybe that region is a conserved domain important for binding DNA or ATP. If the alignment only occurs in one part of the protein, it might not be the entire protein that’s conserved—just a functional part. This is common with proteins that share functional domains but are otherwise unrelated.
- Graphic Summary: Shows where the best hits align. Red bars mean highly similar regions.
Record Your Observations
Use the table below to document your results from different clades (groups of organisms). For each species, note the color bar, % identity, and E-value. Then, hypothesize the function of the protein in that organism. (You should look up the function of your proteins in these organisms. If the functions are unknown, you can come up with your own hypothesis of its functions)
Clade / Organism | Bar Color | Identity % | E-value | Hypothesized Function |
---|---|---|---|---|
Primates (9443) | ||||
Marsupials (9263) | ||||
Monotremes (9255) | ||||
Birds (8782) | ||||
Lizards (8504) | ||||
Amphibians (8292) | ||||
Fishes (117569) | ||||
Fruit Flies (7211) | ||||
Sea Urchins (7625) | ||||
Sponges (6040) | ||||
Arabidopsis (3702) | ||||
Yeast (4932) | ||||
E. coli (562) |
Reflection Questions
- What organisms had the most conserved (similar) version of your protein?
- How do the protein’s functions vary between simpler organisms (like bacteria) and more complex ones (like mammals)?
- Can you identify a trend in protein conservation across evolution?
- What does this say about the importance of your protein?
Example BLASTp Results Table
Query protein: Human MYH7 (Myosin heavy chain 7)
Function in humans: Part of the motor protein complex for cardiac and skeletal muscle contraction.
Clade / Organism | Bar Color | Identity % | E-value | Hypothesized Function |
---|---|---|---|---|
Primates (9443) | Red | 99% | 0.0 | Muscle contraction in heart and limbs |
Marsupials (9263) | Red | 96% | 0.0 | Skeletal and cardiac muscle contraction |
Monotremes (9255) | Red | 94% | 0.0 | Same – contractile protein in heart and skeletal muscle |
Birds (8782) | Red | 85% | 2e-100 | Flight and leg muscle contraction |
Lizards (8504) | Red | 83% | 5e-90 | Muscle movement (locomotion, tail movement) |
Amphibians (8292) | Orange | 78% | 3e-80 | Swimming muscle function; limb movement |
Fishes (117569) | Orange | 70% | 2e-60 | Swimming muscle movement; tail fin muscle control |
Fruit Flies (7211) | Green | 35% | 0.003 | Muscle contraction in wings and legs; partial homolog |
Sea Urchins (7625) | Green | 32% | 0.4 | Tentacle or tube foot movement (using actomyosin system) |
Sponges (6040) | Gray | 18% | 10 | No true muscles; possible ancient cytoskeletal role |
Arabidopsis (3702) (Plant) | Gray | 20% | 8 | Cytoplasmic streaming (actin-myosin-like transport system) |
Yeast (4932) | Gray | 22% | 6 | Organelle transport, cell division via actin-myosin system |
E. coli (562) | Black | 5% | >100 | No homolog found; prokaryotes lack myosin-like proteins |
Example Observations:
- High similarity in vertebrates, suggesting MYH7’s role is critical in muscle function and has been conserved.
- Fruit flies and sea urchins have partial matches—they may use similar motor proteins for movement.
- Plants and fungi don’t have muscles, but they still use actin-myosin-like systems for internal transport.
- Bacteria (like E. coli) don’t have homologs—makes sense since they don’t use motor proteins in the same way.
- Use BLASTp to analyze a protein’s conservation across different species.
- Interpret E-values, bar colors, and identity % to assess evolutionary relatedness.
- Hypothesize how conserved proteins may function differently in different organisms.
- Construct a hypothetical evolutionary timeline for a protein’s role across clades.
- Which species had the most conserved version of your protein? Why might this be?
- How did the protein’s hypothesized function change across evolutionary time?
- Were there any unexpected matches? What might that say about functional conservation?
- What might be the evolutionary advantage of retaining this protein function?
- Which of the following best describes a homologous protein?
- A) Same function, different structure, no evolutionary link
- B) Similar sequence and common ancestry (Correct Answer)
- C) Completely different function and origin
- D) Protein found only in prokaryotes
- What does a red bar in the BLASTp graphic summary indicate?
- A) Low similarity, possible analog
- B) High identity and strong alignment (Correct Answer)
- C) Only the protein's N-terminal matched
- D) Sequence was not found