The results obtained in a BLASTP search depend on the scoring matrix used to assign numerical values to different words. A variety of BLOSUM (BLOcks SUBstitution Matrix) matrices are available, whose utility depends on whether the user is comparing more highly divergent or less divergent sequences. The BLOSUM62 matrix is used as the default scoring matrix for BLASTP. The BLOSUM62 matrix was developed by analyzing the frequencies of amino acid substitutions in clusters of related proteins. Within each cluster, or block, the
amino acid sequences were at least 62% identical when two proteins were aligned. Investigators computationally determined the frequencies of all amino acid substitutions that had occurred in these conserved blocks of proteins. They then used this data to construct the BLOSUM62 scoring matrix for amino acid substitutions. The BLOSUM62 score for a particular substitution is a log-odds score that provides a measure of the biological probability of a substitution relative to the chance probability of the substitution. For a substitution of amino acid i for amino acid j, the score is expressed:
where Pij is the frequency of the substitution in homologous proteins, and qi and qj are the frequencies of amino acids i and j in the database. The term (1/λ) is a scaling factor used to generate integral values in the matrix.
The BLOSUM62 matrix on the following page is consistent with strong evolutionary pressure to conserve protein function. As expected, the most common substitution for any amino acid is itself. Overall, positive scores (shaded) are less common than negative scores, suggesting that most substitutions negatively affect protein function. The most highly conserved amino acids are cysteine, tryptophan and histidine, which have the highest scores. Interestingly, these latter amino acids have unique chemistries and often play important structural or catalytic roles in proteins.