# 16.3: Classification Support Vector Machines

- Page ID
- 41010

The previous section looked at using probabilistic (or generative) models for classification, this section looks at using discriminative techniques in essence, can we run our data through a function to determine its structure? Such discriminative techniques avoid the inherent cost involved in generative models which might require more information than is actually necessary.

Support vector machine techniques essentially involve drawing a vector thats perpendicular to the line(hyperplane) separating the training data. The approach is that we look at the training data to obtain a separating hyperplane so that two classes of data lie on different sides of the hyperplane. There are, in general, many hyperplanes that can separate the data, so we want to draw the hyperplane that separates the data the most - we wish to choose the line that maximizes the distance from the hyperplane to any data point. In other words, the SVM is a maximum margin classifier. You can think of the hyperplane being surrounded with margins of equal size on each side of the line, with no data points inside the margin on either side. We want to draw the line that allows us to draw the largest margin. Note that once the separating line and margin are determined, some data points will be right on the boundary of the margin. These are the data points that keep us from expanding the margin any further, and thus determine the line/margin. Such points are called the support vectors. If we add new data points outside the margin or remove points that are not support vectors, we will not change the maximum margin we can achieve with any hyperplane.

Suppose that the vector perpendicular to the hyperplane is w, and that the hyperplane passes through the point \(\left(\frac{b}{|w|}\right)\). Then a point x is classified as being in the positive class if w ∗ x is greater than b, and negative otherwise. It can be shown that the optimal w, that is, the hyperplane that achieves the maximum margin, can actually be written as a linear combination of the data vectors Σa_{i}_{ }∗ x_{i}. Then, to classify a new data point x, we need to take the dot product of w with x to arrive at a scalar. Notice that this scalar, Σa_{i} ∗ (x_{i} ∗ x) only depends on the dot product between x and the training vectors x_{i}s. Furthermore, it can be shown that finding the maximum margin hyperplane for a set of (training) points amounts to maximizing a linear program where the objective function only depends on the dot product of the training points with each other. This is good because it tells us that the complexity of solving that linear program is independent of the of dimension of the data points. If we precompute the pairwise dot products of the training vectors, then it makes no difference what the dimensionality of the data is in regards to the running time of solving the linear program.

## Kernels

We see that SVMs are dependent only on the dot product of the vectors. So, if we call our transformation φ(v), for two vectors we only care about the value of φ(v_{1})·φ(v_{2}) The trick to using kernels is to realize that for certain transformations φ, there exists a function K(v_{1},v_{2}), such that:

K(v_{1}, v_{2}) = φ(v_{1}) · φ(v_{2})

In the above relation, the right-hand side is the dot product of vectors with very high dimension, but the left-hand side is the function of two vectors with lower dimension. In our previous example of mapping x → (x,y = x^{2}), we get

K(x_{1}, x_{2}) = (x_{1}x^{2}_{1}) · (x_{2}, x^{2}_{2}) = x_{1}x_{2} + (x_{1}x_{2})^{2}

Now we did not actually apply the transformation φ, we can do all our calculations in the lower dimensional space, but get all the power of using a higher dimension.

Example kernels are the following:

- Linear kernel: K (v
_{1}, v_{2}) = v_{1}· v_{2}which represents the trivial mapping of φ(x) = x - Polynomial kernel: K (v
_{1}, v_{2}) = (1 + v_{1}· v_{2})^{n}which was used in the previous example with n = 2. - Radial basis kernel: K(v1,v2) = exp(−β|v1 −v2|
^{2}) This transformation is actually from a point v1 to a function (which can be thought of as being a point in Hilbert space) in an infinite-dimensional space. So what were actually doing is transforming our training set into functions, and combining the to get a decision boundary. The functions are Gaussians centered at the input points. - Sigmoid kernel: K(v1, v2) = tanh[β(v
_{1}^{T}v_{2}+ r)] Sigmoid kernels have been popular for use in SVMs due to their origin in neural networks (e.g. sigmoid kernel functions are equivalent to two-level, perceptron neural networks). It has been pointed out in previous work (Vapnik 1995) that the kernel matrix may not be positive semi-definite for certain values of the parameters μ and r. The sigmoid kernel has nevertheless been used in practical applications [2].

Here is a specific example of a kernel function. Consider the two classes of one-dimensional data:

{−5, −4, −3, 3, 4, 5}*and*{−2, −1, 0, 1, 2}

This data is clearly not linearly separable, and the best separation boundary we can find might be x > −2.5. Now consider applying the transformation . The data can now be written as new pairs,

{−5, −4, −3, 3, 4, 5} → {(−5, 25), (−4, 16), (−3, 9), (3, 9), (4, 16), (5, 25)}

and

{−2, −1, 0, 1, 2} → {(−2, −4), (−1, 1), (0, 0), (1, 1), (2, 4)}

This data is separable by the rule y > 6.5, and in general the more dimensions we transform data to the more separable it becomes.

An alternate way of thinking of this problem is to transform the classifier back in to the original low- dimensional space. In this particular example, we would get the rule x^{2} < 6.5 , which would bisect the number line at two points. In general, the higher dimensionality of the space that we transform to, the more complicated a classifier we get when we transform back to the original space.

One of the caveats of transforming the input data using a kernel is the risk of overfitting (or over- classifying) the data. More generally, the SVM may generate so many feature vector dimensions that it does not generalize well to other data. To avoid overfitting, cross-validation is typically used to evaluate the fitting provided by each parameter set tried during the grid or pattern search process. In the radial-basis kernel, you can essentially increase the value of β until each point is within its own classification region (thereby defeating the classification process altogether). SVMs generally avoid this problem of over-fitting due to the fact that they maximize margins between data points.

When using difficult-to-separate training sets, SVMs can incorporate a cost parameter C, to allow some flexibility in separating the categories. This parameter controls the trade-off between allowing training errors and forcing rigid margins. It can thereby create a soft margin that permits some misclassifications. Increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well.

Can we use just any function as our kernel? The answer to this is provided by Mercers Condition which provides us an analytical criterion for choosing an acceptable kernel. Mercers Condition states that a kernel K(x, y) is a valid kernel if and only if the following holds For any g(x) such that \( \int g(x)^{2} dx \) is finite, we have:

\[\iint K(x, y) g(x) g(y) d x d y \geq 0 \nonumber [3] \]

In all, we have defined SVM discriminators and shown how to perform classification with appropriate kernel mapping functions that allow performing computations on lower dimension while being to capture all the information available at higher dimensions. The next section describes the application of SVMs to the classification of tumors for cancer diagnostics.