BLASTP gives a pairwise alignment of sequences that is very useful for identifying homologs. Multiple sequence alignments compare a larger number of sequences simultaneously. By comparing a larger number of sequences over a wider evolutionary range, multiple sequence alignments allow researchers to identify regions of a protein that are most highly conserved, and therefore, more likely to be important for the function of a protein. In this exercise, we will study conservation of protein sequences in a number of model organisms that are widely used in genetic studies. The genomes for model organisms have been sequenced, and techniques for genetic analysis are well-developed. In addition, database and clone resources are available to support research with model organisms. The organisms below have been selected because they represent important branches of evolution and because they are potential candidates for future research in this course.
Escherichia coli strain K-12 (gram negative; K-12 is the standard laboratory strain)Bacillus subtilis strain 168 (gram positive reference strain)
Saccharomyces cerevisiae - needs to be included in trees and alignments!Schizosaccharomyces pombe Arabidopsis thaliana - thale crress; model organism for flowering plantsCaenorhabditis elegans - nematode model organism used in developmental studiesMus musculus - laboratory mouse
Collect the sequences and BLAST data
The first step in a multiple sequence alignment is to collect the sequence data and analyze the BLASTP data that compare the sequences with the S. cerevisiae sequence. We will be using the reference sequences for the organisms, which begin with a NP___ number. Since you already know how to find NP____ records and use BLASTP, we will take some shortcuts to finding the remaining numbers and BLASTP statistics. For the eukaryotic sequences, we will use BLASTP data that are already available in NCBI’s Homologene database at NCBI (Sayers et al., 2012). The accession numbers for the bacterial species will be available on Canvas and in the lab.
Access Homologene at: http://www.ncbi.nlm.nih.gov/homologene
Click on Release Statistics to see the species that have been included in the BLASTP searchers. Enter the name of your gene into the search box. This brings up the various Homologene groups that have a gene with that name. If search brings you to a page with more than one Homologene group list, click on the Homologene group that contains the S. cerevisiaegene.
Record the accession number for the Homologene group:
The top line of a Homologene record provides the accession number and summarizes the taxonomic distribution of homologs in eukaryotes (“Gene conserved in _________”) A narrowly conserved protein might only be found in the Ascomycota, while a widely-districuted protein would be found in the Eukaryota.
What phylogenetic divisions have homologs of your gene?
The left column of each Homologene record has links to comprehensive gene summaries prepared by NCBI curators. The right column has links to the NP___ records and a graphic showing conserved domains in the homologs. (Domains area noted with different colors.)
How many domains are found in the S. cerevisiae protein? Are the domains equally well-conserved between species?
Record the NP___ numbers for homologs of your S. cerevisiae Metp protein in S.
pombe, A. thaliana, C. elegans and M. musculus. Add the NP_ numbers for E. coli and B. subtilishomologs from the posted data sheet. (Some bacterial records may have XP__ or ZP___ prefixes, because the proteins have not been studied experimentally.) If you have less than five entries,
e.g. the protein is narrowly restricted to Ascomycota, add two additional species from the Homologene group that contains your
Does the S. pombe ortholog of your MET gene have a different name? You will need this information later in this chapter.
Next, perform a pairwise BLASTP alignment for each sequence against the S. cerevisiaesequence. Collecting BLASTP data is easy with Homologene: Use the grey box on the lower hand side of the page to set up each BLASTP comparison. Record the total score, % coverage and E-value for each match.
In the next step, you will prepare a multiple sequence alignment using the sequence information in the NP___ records. Using the BLASTP data, it may be possible to exclude some sequences from further study. The best matches will have high total scores and % coverage (fraction of the two proteins that are aligned) and low E-values. For the rest of this assignment, exclude sequences where the total score is less than 100 and E-values are greater than 1E-10.
Prepare the multiple sequence alignment.
We will use the Phylogeny suite of programs to construct a multiple sequence alignment and phylogenetic tree. Phylogeny describes itself as providing “Robust Phylogenetic Analysis for the Non-Specialist.” You will be working with material at two different sites, so you need two operational browser pages. One browser tab should remain at NCBI, where you will retrieve records. Direct the other browser page to http://www.phylogeny.fr
- Under the heading Phylogeny Analysis tab, select One Click. After you enter the data, your sequences will be automatically brought through multiple alignment and phylogenetic tree building algorithms. The advanced option on this page would allow you to adjust the parameters associated with each program. We will let Phylogeny make these decisions for us!
- Enter the protein sequence in FASTA format. To obtain a FASTA file, enter the NP__number into the search box of the NCBI Protein Database. (Alternatively, you can click to the NP_ record from the Homologene summary page.) The first sequence in your analysis should be the S. cerevisiae protein. Click the FASTA link at the upper left side of the NP record. Copy the title line, beginning with > and the entire amino acid sequence. Paste the FASTA sequence DIRECTLY into the Phylogeny text box. Repeat this step with each of the sequences that you would like to compare.
- Edit the title lines of the FASTA files to include ONLY the species name. (You will see why later!) Each FASTA title line must begin with a > symbol (bird-beak) and end with a hard return. These characters provide the punctuation for the computer. DO NOT use a text editor or work processor to edit the FASTA files, since these introduce hidden punctuation that interferes with the phylogenetic analysis.
- When you are finished, enter your email address (this is useful if you want to come back to your analysis in the next few days) and click the Submit button. Your results will be posted on a web page.
Export and print the multiple sequence alignment
- Click on the Alignment tab to view the multiple sequence alignment.
- Under outputs, ask for the alignment in ClustalW format. The Clustal W alignment appears on a new web page. Note that the bottom line of each cluster indicates if an amino acid is invariant at the position by an asterisk. The positions of conserved amino acids are indicated by colons in the bottom line.
- Right-click on the page and download the Clustal alignment with a new filename that makes sense to you. The page will download as a text file that you will open in Word or a text editor.
- Open the file in a word processor. Adjust the font size and page breaks so that sequences are properly aligned and all members of a cluster fit on the same page. Choose a non- proportional font such as Courier so that the amino acids line up properly.
- Print the file and check that the format is correct! Turn it in with the Phylogeny assignment.
Construct a phylogenetic tree.
- Click the Tree Rendering tab to access your phylogenetic tree.
- You may use the editing tools to alter the appearance of your tree. Pay particular attention to the legends in the “leaves” of the tree, which should have the species names.
- Download the file in a format of your choice. Print the file and turn it in with the phylogeny assignment.