-
Notifications
You must be signed in to change notification settings - Fork 83
SVM Classification
##Overview
Support Vector Machines (SVM) classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. Generally, the best hyperplane for an SVM, means the one with the largest margin between the two classes. The margin (often represented by epsilon), is determined by the support vectors, which are the closest data points to the hyperplane, and tangent on the line parallel to the hyperplane, at epsilon distance. Therefore, the margin is the maximal width, parallel to the hyperplane, that contains no interior data points.
Today, support vector machines are used to model complex, real-world problems such as text and image classification, hand-writing recognition, and bioinformatics (or biosequence) analysis.
Note: when the dimensional space is n, the hyperplane is n-1 dimensions. For a trivial example, given a line, pick any point. The chosen point divides the line into two parts. Now, in a two-dimensional plane, pick any line. The chosen line divides the plane into two parts.
###Inseparable Data
Though, many times a supplied data is assumed to be separable with a hyperplane, there are many cases where it is not. For example, consider a circle in a two-dimensional space, with dots on along the perimeter, with alternative colors (red, blue). There are no separating hyperplanes (in two-dimension), that can separate the dots.
Now, consider another (simpler) inseparable example, a one-dimensional line with alternating dots (red, blue). By definition, a one-dimensional hyperplane cannot be used for a one-dimensional space. However, if the line is projected to a two dimensional space, f: x -> (x, x^2)
, a hyperplane (any arbitrary line) can be determined. Therefore, the projected space (higher dimension) can be used by the support vector machine, to determine a separating plane. The function (f) responsible for the transformation is called the kernel function.
###Kernel Functions
Support vector machines generally implement kernel functions to transform a given vector space, to a higher dimension. This process (as discussed earlier), resolves cases of inseparable data. The scikit-learn library, by default always defines a kernel function. For example, the SVC
class, by default, implements the rbf
kernel.
Note: this wiki assumes the use of the SVC class. However, the scikit-learn provides additional classes, including:
Though, the scikit-learn classes, by default define a default kernel, it can always be explicitly defined with another. For example, the following kernels are supported within the library:
- linear:
svc=svm.SVC(kernel='linear')
- polynomial: requires a polynomial degree,
svc=svm.SVC(kernel='poly', degree=3)
- radial basis function (rbf): default for SVC class,
svc=svm.SVC(kernel='rbf')
- sigmoid: function having an "S" shape curve
- custom: a custom kernel can be defiend either via a python function, or by precomputing the Gram matrix
In general, each (of the above) kernel allows additional keyword arguments. But, in most cases, since the SVC
class defines default values, the additional arguments can be omitted.
Note: a kernel simply transforms the vector space, while the SVM determine the hyperplane(s), by determining support vectors with the largest epsilon distance to the hyperplane.
Note: implementing a kernel (other than linear), is synonymous to the term, kernel trick.
####Custom Kernels
Python function used for a custom SVM kernel:
import numpy as np
from sklearn import svm
def my_kernel(x, y):
return np.dot(x, y.T)
...
clf = svm.SVC(kernel=my_kernel)
clf.fit(X, Y)
Note: y.T
gives the transpose of y
, and is required to define the above linear kernel.
Gram matrix used for a custom SVM kernel:
import numpy as np
from sklearn import svm
X = np.array([[0, 0], [1, 1]])
y = [0, 1]
clf = svm.SVC(kernel='precomputed')
# linear kernel computation
gram = np.dot(X, X.T)
clf.fit(gram, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='precomputed', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False)
# predict on training examples
clf.predict(gram)
array([0, 1])
Note: when using the Gram matrix, remember to set kernel='precomputed'
, and pass the Gram matrix instead of X in the fit method.
####Kernel Suggestion
Sometimes, choosing a kernel may not be trivial, and a difficult choice. For these cases, the scikit-learn library, has provided the gridsearch. This functionality, essentially performs kernel suggestions. However, the implementation requires various ranges (i.e. kernels, C, gamma, degree) to be explicitly defined, in order to be tested, and a suggestion to be provided.
###Classification
Generally, SVM were originally designed for binary classification. However, there are two general approaches to implement multi-class classification. The first is to combine the results of multiple binary classifiers, the second, is to treat the overall problem, as a single optimization problem. The former has the following approaches:
- One Against One (OAO)
- One Against All (OAA)
- Crammer Sing
In general, the scikit-learn library implements classification as a series of binary classifiers. Specifically, the one-against-one approach is used for the SVC, and NuSVC classes. The LinearSVC class, on the other hand, can implement either the one-against-all, or the crammer-sing strategy.
####Classifiers
Classifiers are used to predict the class from a trained data set. As discussed earlier, an svm is a binary classifier, and can only distinguish between two classes. However, a multi-class classification, is simply a series of binary classifiers. Depending on the approach (i.e. OAO, OAA, Cramer Sing), the number of classifiers vary.
Specifically, the OAO approach has n_class * (n_class - 1) / 2
classifiers, where each one trains from two classes. However, the OAA has the same number of classifiers as classes, except for the trivial binary case (two classes), which would consist of a single classifier.