# 16.2: Classification--Bayesian Techniques

- Page ID
- 41009

Consider the problem of identifying mitochondrial proteins. If we look at the human genome, how do we determine which proteins are involved in mitochondrial processes, or more generally which proteins are targeted to the mitochondria1?^{1} This is particularly useful because if we know the mitochondrial proteins, we can study how these proteins mediate disease processes and metabolic functions. The classification method we will look considers 7 features for all human proteins:

1. targeting signal

2. protein domains

3. co-expression

4. mass spectrometry

5. sequence homology

6. induction

7. motifs

Our overall approach will be to determine how these features are distributed for both mitochondrial and non-mitochondrial proteins. Then, given a new protein, we can apply probabilistic analysis to these seven features to decide which class it most likely falls into.

## Single Features and Bayes Rule

Let’s just focus on one feature at first. We must first assume that there is a class dependent distribution for the features. We must first derive this distribution from real data. The second thing we need is the a priori chance of drawing a sample of particular class before looking at the data. The chance of getting a particular class is simply the relative size of the class. Once we have these probabilities, we can use Bayes rule to get the probability a sample is in a particular class given the data(this is called the posterior). We have forward generative probabilities, and use Bayes rules to perform the backwards inference. Note that it is not enough to just consider the probability the feature was drawn from each class dependent distribution, because if we knew a priori that one class(say class A) is much more common than the other, then it should take overwhelming evidence that the feature was drawn from class B’s distribution for us to believe the feature was indeed from class B. The correct way to find what we need based on both evidence and prior knowledge is to use Bayes Rule:

\[ P( Class \mid feature )=\left(\frac{P(\text { feature } \mid \text { Class }) P(\text { Class })}{P(\text { feature })}\right) \nonumber \]

• Posterior : P(Class|feature)

• Prior : P(Class)

• Likelihood : P(feature|Class)

This formula gives us exactly the connection we need to flip known feature probabilities into class proba- bilities for our classifying algorithm. It lets us integrate both the likelihood we derive from our observations and our prior knowledge about how common something is. In the case of mtDNA, for example, we can estimate that mitochondrial DNA makes up something like 1500/21000 (i.e. less than 10%) of the human genome. Therefore, applying Bayes rule, our classifier should only classify a gene as mitochondrial if there is a very strong likelihood based on the observed features, since the prior probability that any gene is mitochondrial is so low.

With this rule, we can now form a maximum likelihood rule for predicting an objects class based on an observed feature. We want to choose the class that has the highest probability given the observed feature, so we will choose Class1 instead of Class2 if:

\[ \left(\frac{P(\text { feature } \mid \text { Class } 1) P(\text { Class } 1)}{P(\text { feature })}\right)>\left(\frac{P(\text { feature } \mid \text { Class } 2) P(\text { Class } 2)}{P(\text { feature })}\right) \nonumber \]

Notice that P(feature) appears on both sides, so we can cancel that out entirely, and simply choose the class with the highest value of P(feature|Class)P(Class).

Another way of looking at this is as a discriminant function: By rearranging the formulas above and taking the logarithm, we should select Class1 instead of Class2 precisely when

\[\log \left(\frac{P(\mathrm{X} \mid \mathrm{Class} 1) P(\text { Class } 1)}{P(\mathrm{X} \mid \text { Class } 2) P(\text { Class } 2)}\right)>0 \nonumber \]

In this case the use of logarithms provide distinct advantages:

1. Numerical stability

2. Easier math (its easier to add the expanded terms than multiply them)

3. Monotonically increasing discriminators.

This discriminant function does not capture the penalties associated with misclassification (in other words, is one classification more detrimental than other). In this case, we are essentially minimizing the number of misclassifications we make overall, but not assigning penalties to individual misclassifications. From examples discussed in class and in the problem set - if we are trying to classify a patient as having cancer or not, it could be argued that it is far more harmful to misclassify a patient as being healthy if they have cancer than to misclassify a patient as having cancer if they are healthy. In the first case, the patient will not be treated and would be more likely to die, whereas the second mistake involves emotional grief but no greater chance of loss of life. To formalize the penalty of misclassification we define something called a loss function,Lkf , which assigns a loss to the misclassification of an object as class j when the true class is class k (a specific example of a loss function was seen in Problem Set 2).

## Collecting Data

The preceding tells us how to handle predictions if we already know the exact probabilities corresponding to each class. If we want to classify mitochondrial proteins based on feature X, we still need ways of determining the probabilities P(mito), P(not mito), P(X|mito) and P(X|not mito). To do this, we need a training set: a set of data that is already classified that our algorithm can use to learn the distributions corresponding to each class. A **high-quality training set** (one that is both large and unbiased) is the most important part of any classifier. An important question at this point is, how much data do we need about known genes in order to build a good classifier for unknown genes? This is a hard question whose answer is not fully known. However, there are some simple methods that can give us a good estimate: when we have a fixed set of training data, we can keep a holdout set that we dont use for our algorithm, and instead use those (known) data points to test the accuracy of our algorithm when we try to classify them. By trying different sizes of training versus holdout set, we can check the accuracy curve of our algorithm. Generally speaking, we have enough training data when we see the accuracy curve flatten out as we increase the amount of training data (this indicates that additional data is likely to give only a slight marginal improvement). The holdout set is also called the test set, because it allows us to test the generalization power of our classifier.

Supposing we have already collected our training data, however, how should we model P(X|Class)? There are many possibilities. One is to use the same approach we did with clustering in the last lecture and model the feature as a Gaussian then we can follow the maximum likelihood principle to find the best center and variance. The one used in the mitochondrial study is a simple density estimate: for each feature, divide the range of possibilities into a set of bins (say, five bins per feature). Then we use the given data to estimate the probability of a feature falling into each bin for a given class. The principle behind this is again maximum likelihood, but for a multinomial distribution rather than a Gaussian. We may choose to discretize a otherwise continuous distribution because estimating a continuous distribution can be complex.

There is one issue with this strategy: what if one of the bins has zero samples in it? A probability of zero will override everything else in our formulas, so that instead of thinking this bin is merely unlikely, our classifier will believe it is impossible. There are many possible solutions, but the one taken here is to apply the Laplace Correction: add some small amount (say, one element) into each bin, to draw probability estimates slightly towards uniform and account for the fact that (in most cases) none of the bins are truly impossible. Another way to avoid having to apply the correction is to choose bins that are not too small so that bins will not have zero samples in them in practice. If you have many many points, you can have more bins, but run the risk of overfitting your training data.

## Estimating Priors

We now have a method for approximating the feature distribution for a given class, but we still need to know the relative probability of the classes themselves. There are three general approaches:

- Estimate the priors by counting the relative frequency of each class in the training data. This is prone to bias, however, since data available is often skewed disproportionately towards less common classes (since those are often targeted for special study). If we have a high-quality (representative) sample for our training data, however, this works very well.
- Estimate from expert knowledge—there may be previous estimates obtained by other methods inde- pendent of our training data, which we can then use as a first approximation in our own predictions. In other words, you might ask experts what the percentage of mitochondrial proteins are.
- Assume all classes are equally likely we would typically do this if we have no information at all about the true frequencies. This is effectively what we do when we use the maximum likelihood principle: our clustering algorithm was essentially using Bayesian analysis under the assumption that all priors are equal. This is actually a strong assumption, but when you have no other data, this is the best you can do.

For classifying mitochondrial DNA, we use method (2), since some estimates on the proportions of mtDNA were already known. But there is an complication there are more than 1 features.

## Multiple features and Naive Bayes

In classifying mitochondrial DNA, we were looking at 7 features and not just one. In order to use the preceding methods with multiple features, we would need not just one bin for each individual feature range, but one for each combination of features if we look at two features with five ranges each, thats already 25 bins. All seven features gives us almost 80,000 bins and we can expect that most of those bins will be empty simply because we dont have enough training data to fill them all. This would cause problems because zeroes cause infinite changes in the probabilities of being in one class. Clearly this approach wont scale well as we add more features, so we need to estimate combined probabilities in a better way.

The solution we will use is to assume the features are independent, that is, that once we know the class, the probability distribution of any feature is unaffected by the values of the other features. This is the Nave Bayes Assumption, and it is almost always false, but it is often used anyway for the combined reasons that it is very easy to manipulate mathematically and it is often close enough to the truth that it gives a reasonable approximation. (Note that this assumption does not say that all features are independent: if we look at the overall model, there can be strong connections between different features, but the assumption says that those connections are divided by the different classes, and that within each individual class there are no further dependencies.) Also, if you know that some features are coupled, you could learn the joint distribution in only some pairs of the features.

Once we assume independence, the probability of combined features is simply the product of the individual probabilities associated with each feature. So we now have:

\[ P\left(f_{1}, f_{2}, K, f_{N} \mid\right. Class )=P\left(f_{1} \mid\right. Class ) P\left(f_{2} \mid\right. Class ) K P\left(f_{N} \mid\right. Class ) \nonumber \]

Where f1 represents feature 1. Similarly, the discriminant function can be changed to the multiplication

of the prior probabilities:

\[ G\left(f_{1}, f_{2}, K, f_{N}\right)=\log \left(\frac{\Pi P\left(f_{1} \mid \text { Class } 1\right) P(\text { Class } 1)}{\Pi P\left(f_{1} \mid \text { Class } 2\right) P(\text { Class } 2)}\right) \nonumber \]

## Testing a classifier

A classifier should always be tested on data not contained in its training set. We can imagine in the worst case an algorithm that just memorized its training data and behaved randomly on anything else a classifier that did this would perform perfectly on its training data, but that indicates nothing about its real performance on new inputs. This is why its important to use a test, or holdout, set as mentioned earlier. However, a simple error rate doesnt encapsulate all of the possible consequences of an error. For a simple binary classifier (an object is either in or not in a single target class), there are the following for types of errors:

1. True positive (TP)

2. True negative (TN)

3. False positive (FP)

4. False negative (FN)

The frequency of these errors can be encapsulated in performance metrics of a classifier which are defined as,

- Sensitivity what fraction of objects that are in a class are correctly labeled as that class? That is, what fraction have true positive results? High sensitivity means that elements of a class are very likely to be labeled as that class. Low sensitivity means there are too many false negatives.
- Specificity what fraction of objects not in a class are correctly labeled as not being in that class? That is, what fraction have true negative results? High specificity means that elements labeled as belonging to a class are very likely to actually belong to it. Low specificity means there are too many false positives.

In most algorithms there is a **tradeoff between sensitivity and specificity**. For example, we can reach a sensitivity of 100% by labeling everything as belonging to the target class, but we will have a specificity of 0%, so this is not useful. Generally, most algorithms have some probability cutoff they use to decide whether to label an object as belonging to a class (for example, our discriminant function above). Raising that threshold increases the specificity but decreases the sensitivity, and decreasing the threshold does the reverse. The MAESTRO algorithm for classifying mitochondrial proteins (described in this lecture) achieves 99% specificity and 71% sensitivity.

## MAESTRO Mitochondrial Protein Classification

They find a class dependent distribution for each feature by creating several bins and evaluating the pro- portion of mitochondrial and non mitochondrial proteins in each bin. This lets you evaluate the **usefulness of each feature** in classification. You end up with a bunch of medium strength classifiers, but when you combine them together, you hopefully end up with a stronger classifier. Calvo et al. [1] sought to construct high-quality predictions of human proteins localized to the mitochondrion by generating and integrating data sets that provide complementary clues about mitochondrial localization. Specifically, for each human gene product p, they assign a score s_{i}(p), using each of the following seven genome-scale data sets targeting signal score, protein domain score, cis-motif score, yeast homology score, ancestry score, coexpression score, and induction score (details of each of the meaning and content of each of these data sets can be found in the manuscript). Each of these scores s_{1} − S_{7} can be used individually as a weak genome-wide predictor of mitochondrial localization. Each methods performance was assessed using large gold standard curated training sets - 654 mitochondrial proteins T_{mito} maintained by the MitoP2 database1 and 2,847 nonmitochondrial proteins T mito annotated to localize to other cellular compartments. To improve prediction accuracy, the authors integrated these eight approaches using a nave Bayes classifier that was implemented as a program called MAESTRO. So we can take several weak classifiers, and combine them to get a stronger classifier.

When MAESTRO was applied across the human proteome, 1451 proteins were predicted as mitochondrial proteins and 450 novel proteins predictions were made. As mentioned in the previous section The MAESTRO algorithm achieves a 99% specificity and a 71% sensitivity for the classification of mitochondrial proteins, suggesting that even with the assumption of feature independence, Nave Bayes classification techniques can prove extremely powerful for large-scale (i.e. genome-wide) scale classification.

^{1} Mitochondria is the energy producing machinery of cell. Very early in life, the mitochondria was engulfed by the predecessor to modern day eukaryotes, and now, we have different compartments in our cells. So the mitochonria has its own genome, but it is very depleted from its own ancestral genome - only about 11 genes remain. But there are hundreds are genes that make the mitochondria work, and these proteins are encoded by genes transcribed in the nucleus, and then transported to the mitochondria. So the goal is to figure out which proteins encoded in the genome are targeted to the mitochondria. This is important because there are many diseases associated with the mitochonria, such as aging.