9.7: The BLASTP algorithm
- Page ID
In BLASTP, the query sequence is broken into all possible 3-letter words using a moving window. A numerical score is calculated for each word by adding up the values for the amino acids from the BLOSUM62 matrix. Words with a score of 12 of more, i.e. words with more highly conserved amino acids, are collected into the initial BLASTP search set. BLASTP next broadens the search set by adding synonyms that differ from the words at one position. Only synonyms with scores above a threshold value are added to the search set. NCBI BLASTP uses a default threshold of 10 for synonyms, but this can be adjusted by the user. Using this search set, BLAST rapidly scans a database and identifies protein sequences that contain at two or more word/synonyms from the search set. These sequences are set aside for the next phase of the BLASTP process, where these short matches serve as seeds for more extended alignments in both directions from the original match. BLAST keeps a running raw score as it extends the matches. Each new amino acid either increases or decreases the raw score. Penalties are assigned for mismatches and for gaps between the two alignments. In the NCBI default settings, the presence of a gap brings an initial penalty of 11, which increases by 1 for each missing amino acid. Once the score falls below a set level, the alignment ceases. Raw scores are then converted into bit scores by correcting for the scoring matrix used in the search and the size of the database search space.
Overview of the BLASTP process.
The query sequence EAGLES into broken into three-letter words or synonyms that are used as a search set against
records in a protein or translated nucleotide database. See the text for additional details.
The output data from BLASTP includes a table with the bit scores for each alignment
as well as its E-value, or “expect score”. The E-value indicates the number of alignments with
that particular bit score that would be expected to occur solely by chance in the search space. Alignments with the highest bit scores (and lowest E-values) are listed at the top of the table. For perfect or nearly perfect matches, the E-value is reported as zero - there is essentially no possibility that the match occurs randomly. The E-value takes into account both the length of the match and the size of the database that was surveyed. The longer the alignment, and/or the larger the database search space, the less likely that a particular alignment occurs strictly by chance.
In some cases, the alignment may not extend along the entire length of the protein or there may be gaps between aligned regions of the sequences. “Max score” is the bit score for the aligned region with the highest score. “Total score” adds the bit scores for all aligned regions. When there are no gaps in an alignment, the total and max scores are the same. The “Query cover” refers to the fraction of the query sequence where the alignment score is above the threshold value. BLASTP also reports the percentage of aligned amino acids that are identical in two sequences as “Ident.”