This blog post is dedicated to the time of people who read about support vector machines but never got why and how those things worked. In fact, the literature of support vector machines is very heavy on terminogy and concepts that are quite separated from the reality of what is actually going on in high dimensinal data. [If you don’t know what support vector machines are then probably this post will not be of interest to you]
Now here is a simple explanations of what is actually going on. The truth to which I hold, with which some people may disagree of course, is that support vector machines are nothing but a hybrid between nearest neigbhor and weighted linear regression.
Warning: this article may contain technical inaccuries. Use this information at your own judgment. You can use the feedback system on the blog to notify me of any inaccuracies.
A simple learning algorithm
The starting point is a simple learning algorithm called weighted nearest neighbor. This algorithm does not have a training phase.
in testing time: look up the nearest neighbors of of the query. Let those be triples (ni,yi,wi) where ni: neighbor i wi: dot(ni, query) yi: label of ni output: sum(wi,yi)
There are a few problems with this algorithm.
The first problem is that the original features may have to be transformed or weighted somehow to increase the accuracy.
The second problem is that computing the nearest neigbors themselves takes a long time because we have to search through the whole training dataset.
Let us see how to solve the efficiency problem first. It turns out if we solve this problem first we’d get a clue how to solve the accuracy problem.
Typically yi is either +1 or -1 but this is not important now.
We also assume that we return all the neigbors and weigh them by their dot products. Let y = [y1,y2,…,yn]’ be the vector of the labels.
All the dot products between the query and all the training points are given by the row vector query’*X’ (of dim n by 1)
The the final prediction (i.e. the weighed sum is just)
- also note that its best to normalize the data point vectors
y’(query’X’)’ = y’X query [just to check the dimensions (1,n) x (n ^ m) x (m x 1) = 1 x 1
Now to get efficiency one needs to precompute y’X which is a (1 x m) vector.
In training time we get to do the prediction in time m because we do the dot product between z = y’X and query
Now we want to address the second problem, which is improving the accuracy of the prediction by considering not the original X but a weighed version of X. What does b1x[,1] + b2[,2] + … + mean?
b1,b2,…, is just a line on which we project all the data. In high dimensions, if b1,b2, … were random we’d get this:
Why do we want to do that?
So we replace X by X*beta, which reduces our dataset to one dimension. We have to apply the same transformation to the query. And then we get the prediction to be:
[supervised vs. unsupervised] - draw the two histograms, and may be they overlap. make the case for huge beta’*beta
What is beta’*beta ?
This is just the covariance matrix between the coefficients. If we take a trick out of the playbook of linear regression beta’beta is (X*X’ + lambda I)
So somehow we started from weighted nearest neigbor and ended up at linear regression.
Now it does not mean we should accept this solution as the solution to the SVD problem. All I am saying is that if we get the beta out of an SVM solver and plug it into this nearest neighbor formulation we are going to get the SVM solution.
We can also look at
(y’Xbeta’)*(beta)query in another way. by rearranging the brackets.
Now if you get the beta out of the SVM, and look at (X beta’) you get a (n x 1). Now y’ x (n x 1) is just a dot product. So Xbeta’ are just weights on some of the training points.
Let’s take a look at our picture of projection:
draw picture. The weights are just the projections of the points on the projection line. If you were to take some points from both ends of the line, this will be meaningful, and this is what the SVM does. Those are just the support points.
Now some people misleadingly believe that you just need two (or at least) points to be the support vectors. Now you’ll see how this is very wrong. First even though we represent the projection line as a 1-dim object our points live in an m-dimensional space. So when you get two points only you are not going to be able estimate m coefficients. You need at least m points, but this is not enough. m points is enough from a linear algebra perspective, but not from a statistics perspective. In statistics we have the so called variance which is especially high in large dimensions.
Also note that even if you know the projection vector there are (m - 1) more dimensions. So even if I told you the optimal coefficients (which are known as the optimal decision boundary) and I ask you to give me some points from which I can extract the same decision boundary, you’d need quite a few points. Selecting the points in order to get a decision boundary from them is called active learning. You need this insight in order to solve active learning: it’s called the optimal design of experiments and is a bit obscure theory in statistics [combined with some other insights about high dimensional geometry this gives up that we need to expore 100log(n) possible splitting projection lines from random projections and for each line we probably do another log(n) at each corner. So, this is 2100*log(n)^2.
–aside how many labeled points are required to training a learning algorithm: it’s about the geometry of points. you need more unlabeled points to figure out the geometry.
So, the SVM penalty just tells us exactly that: we should remove some of the points.
In either case the number of so called support points might be smaller than the dataset, just because of this high dimensional argument:
draw a picture: show how many guys are in the middle – not too many because we otherwise get a huge error. Then look at the restriction: points to the left of the splitting line, points to the right of the splitting line. choose the furthest points.
Aside: if you have a non-linear decision boundary (like in logistic regression, you probably want to project to the gradient which
[when we truncate we get a spherical cap: points still in m dim. space]
So, without even solving the SVM formulation we can get an approximation of the support points, wow!)
Now notice that when we know the SVM coefficients, we can easily get good support points from the above intuition:
this is compute X’W (i.e. the gradient representation), and get points the large norms that are diverse. [how to do that efficiently?] X’W => SVD => Random Proj on the SVD data and choose two furthest points.
Notice that logistic regression that that iteratively:
We can do the same algorithm for SVM. The SVM will just make a hard choice. whether to include a point, while the logistic regression will make a soft choice.
Here is the algorithm: Take the points (y,x) and fit a weighted linear regression [even better a sparse linear regression because it will likely result in larger separation, but this is harder]
Let w be the coefficients from regression. Compute the gradient of the data points. This gradient represents the projection of the point on a projection line. [you can figure out the equation of the actuall projection line] Project the whole dataset and remove points in the middle.
Repeat the same using the truncated dataset for estimation, but full dataset to project. This may have an advantage to logistic regression in that it avoids weighting some points too heavily (i.e. outliers). This is one finding that people have reported and is beneficial to the SVM.
So, this is big deal 1: the chumping (thresholding) of the predictions. [this is what makes it different from linear and logistic regression]
Going to high dimensions is not a problem and is sometimes desirable
Now we used the dot product to compute the similarity. We can probably do a bit better. There is some people trick that even people without formal machine learning backgroud have come up with: weight the closest nearest neighbors more heavily. One of the simplest weighting methods is thresholding. This means: just remove the neighbors after their similarity is below a threshold.
Now if you do that in our formulation you are going to lose efficiency because we cannot apply linear algebra. Notice that the thresholding operation is essentially a non-linear operation. Non-linear operations are powerful but make optimization problems difficult to solve.
Now here comes some motivation about kernels. Instead of thresholding consider exponentiation. If w is the original dot product we consider exp(w). Notice that w = dot(query,x) = 1 - 0.5 dist(query, w)). [because we assumed that the data points are normalized] Then exp(w) is related to exp(-dist(…)) ignoring some constant. This is the so called euclidean kernel. Now what the SVM tells you is that this has nice property because in the new space we can apply again our linear algebra trick. One way to do that very simple is the so called kernel expansion via randomized functions. Essentially we make our non-linear function as sum of a sufficiently large number of linear functions. So, this is some operational insight that represent one line to solve the SVM with the euclidean kernel.
I’ll probably return to that but now I want to see what exponentiation means from data point of view.
Exponentiation makes similar points more similar and dissimilar points less dissimilar. In statistical terms this means that in kernel space we have emphasized some structure (close points) and deemphasized other structure (distant points). This is good and is a property exploited by other algorithms, a notable example being t-SNA for visualization. t-SNA is similar in that to kernels, but it also puts the data in a low dim. space, while SVM kernels accomplish their task by putting the data in high dim. space.
In high dimensional spaces all the distances become very large. So, the claim is that a better separation between the classes can be achieved more easily.
Can we somehow verify this. It turns out that yes. Here is how.
Take the matrix XX’ and compare it to KK’. Both matrices are n by n and give the similarities between all pairs of points. Now we expect the SVM to work if it assigns similar labels to similar points. So we can compute something like pairwise error = sign(yi,yj)*K(i,j). Now positive entries are “correct” while negative entries are “in-correct” It’s true that in the end the SVM will average a bunch of those so it will clean the noise, but if a large fraction of those were wrong, we don’t have much chance. So then we want to achieve a separation between correct and noise. Typically the correct guys will form a cluster (this is why machine learning works while the incorrect guys are just everywhere). Now this is just an n dimensional space where each data point is represent by the agreement to the neighbors. … So we just expect this cluster to be linearly separable from the noise because it has structure. Because we empahsized the structure, we expect the prediction problem to be solved better.
The fact that the underlying space of K(,) does not matter. In practice you can reduce the dimension to somthing like log(n) using random projections or related methods. What matters is how the distances between the points change.
The quadratic kernel emphasizes too much to close points. So use a linear comb. between quadaric and linear. Or do other trick like first do SVD on the data to decorrelate. Then run a few versions: one of which the magnitude of the features is squared. [this will give a better result – verify?] (but we needed to decorelate the points first]
Now notice that another way to solve this is just to solve linear regression with transformed features. [we can do that too: let (cosine(a,b) + cosine(a,b)^2) = //if we make the components of a and b indep. then x[,i]x[,j] will be on avg zero (may be //see if there are other non-linear transformations
– Now this is big deal two. Emphasizing and demaphasizing the distances between points. It’s not so much about putting the points in a new fancy space. The key is to change the distances so the signal to noise ratio is increased.
Now because of the dimension (it is log(n)) we need at least a few times log(n) points to get support vectors.
a^2+b^2 - 2 dot = dist
#2 - dist = 2dot
- I don’t know the data will be separable in high dimensions, but because this is equivalent to the weighted nearest neighbors I can play with
From low dimension to high dimension and back.
From low to dim to high dim => changes the distances distribution [improve the signal to noise ratio] From high dim to low => efficiency. need less support points (easier to find)
Some generalization of the problem to more classes. Instead of yi = +1/-1 make it a vector [0.9,0.1] –> this is one way or [1.0, -1.0]. Now this can generalize NN and if we stick to linearity we have some how to solve those problems as well.
So we need a lot of tailor series expansion + log(n) random proj. and we do the learning in this space.