Wednesday, 15 March 2017

Support Vector Machine ( Theory )



What does it do? Support vector machine (SVM) learns a hyper plane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use decision trees at all.

What is hyper plane? A hyper plane is a function like the equation for a line, y = mx + b.
In fact, for a simple classification task with just 2 features, the hyper plane can be a line.

SVM can perform a trick to project our data into higher dimensions. Once projected into higher dimensions, SVM figures out the best hyper plane which separates your data into the 2 classes.

Example: The simplest example is bunch of positive samples and negative samples on a table. If the samples are not too mixed together, we could take a stick and without moving the balls, we can separate them with the stick.





When a new sample is added on the table, by knowing which side of the stick the sample is on, we can predict whether the sample is positive or negative. The samples represent data points, and the positive and negative samples represent 2 classes. The stick represents the hyper plane which in this case is a line. SVM figures out the function for the hyper plane.
So there are so many possibilities of drawing hyper planes. Conceder the following diagrams.



Figure A,B and C are some examples for possible hyper planes that could be drawn but we need to select the optimal hyper plane that could be drawn. The hyper plane must be drawn in widest street manner that separates the positive samples from negative samples hence it is called widest street approach.
Margins are often associated with SVM. The margin is the distance between the hyper plane and the 2 closest data points from each respective class. In the sample and table example, the distance between the stick and the closest positive and negative sample is the margin.
Margin is also referred as widest street.
The key is: SVM attempts to maximize the margin, so that the hyper plane is just as far away from positive sample as the negative sample. In this way, it decreases the chance of misclassification.


 
Using the positive sample, negative sample and table example, the hyper plane is equidistant from a positive sample and a negative sample. These samples or data points are called support vectors, because they support the hyper plane. Hence the name Support Vector Machine.
So we have to make decision rule that would use that decision boundary to find out whether sample will be placed in left side of the street or right side of the street (STEP 1). Then we need to maximize the margin in order to find out the optimal hyper plane (STEP 2). For this

we use a vector W̅ which is perpendicular to median line pq of the street whose length is unknown yet.


STEP 1:
Let ‘U’ be a unknown and we need to find out whether U belongs to right side of the street or left side of the street. We will draw a vector U̅ as shown in the above figure.

W̅. U̅ gives the projection of U̅ on W̅. So bigger the projection, larger the tendency of sample being on positive or right side of the street. This also implies that smaller the projection, larger the tendency of sample being on negative or left side of the street. We use a constant C in order to represent this relationship in mathematical representation.

W̅. U̅ ≥ C where C=-b


if this condition satisfies then sample is on right side of the median or may be positive or right side of the street. The same equation can be represented as
W̅. U̅+b ≥ 0

Decision Rule
W̅. U̅+b ≥ 0 if this condition satisfies then it implies that the sample is positive otherwise the sample is negative.
The problem with this is we don’t know what the value of W̅ or U̅ is. W̅ must be perpendicular to median and there are various W̅ vectors which are perpendicular to median and of various length.

If we put enough constraints we are able to calculate W̅ and U̅.
Considering the decision rule we have,
W̅. X̟̅ + b ≥ 1 ---- Equation 1
If this condition becomes true for a particular sample, then it implies that sample will fall in right side of the street (or it is a positive sample).
W̅. X̠̅ + b ≤ -1 ---- Equation 2
If this condition becomes true for a particular sample, then it implies that sample will fall in left side of the street (or it is a negative sample).


For mathematical convenience, we combine above two equation using a variable ‘Yᵢ’.
Yᵢ is +1 for positive samples
Yᵢ is -1 for negative samples

Now multiplying Yᵢ with the equations 1 and 2, we will get
[ W̅. X̟̅ + b ≥ 1 ] * Yᵢ -> Yᵢ (W̅. X̅ᵢ + b) ≥ 1 where Yᵢ=+1 -> Yᵢ (W̅. X̅ᵢ + b) -1 ≥ 0
[ W̅. X̠̅ + b ≤ -1 ] * Yᵢ --> Yᵢ (W̅. X̅ᵢ + b) ≥ 1 where Yᵢ=-1 -> Yᵢ (W̅. X̅ᵢ + b) -1 ≥ 0
Both the equations are same, but the same expression may appear in two forms depending on the sample. Either it may appear in the form
Yᵢ (W̅. X̅ᵢ + b) -1 ≥ 0
This implies that depending on the value of Yᵢ, the sample will fall in either right side or left side of the street.
Or
Yᵢ (W̅. X̅ᵢ + b) -1= 0
This implies that the sample will fall inside the street or margin.


STEP 2:
SVM attempts to maximize the margin, so that the hyper plane is just as far away from positive sample as the blue negative sample. In this way, it decreases the chance of misclassification.

Consider the following diagram in order to find out the width of the street.


 
Consider two vectors X̟̅ and X̠̅. And we know that W̅ is normal to median. Hence
Width of the street = (X̟̅- X̠̅). (W̅/|W|) where W̅/|W| gives unit vector
We know that Yᵢ (W̅. X̅ᵢ + b) -1= 0 so above equation will become
WIDTH = (X̟̅- X̠̅). (W̅/|W|)
             = (1-b+1+b) / |W|
             = 2 / |W|
Hence WIDTH = 2/ |W|
So we need to maximize the width, That is, we need to maximize 2/ |W|. This also implies that we need to minimize the value of |W|. The effective way to minimize the value of |W| is taking half the value of the square of the decision vector magnitude, i.e 0.5 * (|W|) ²

So in order to minimize 0.5 * (|W|) ² or maximize the width of the street we use Lagrange multiplier. Lagrange multiplier helps us to maximize or minimize particular function without thinking about the constraint. We use a variable L in order to maximize the width of the street. As Lagrange is used to maximize or minimize without considering the constraints,


 
In order to find out the decision vector value, differentiate the above equation with respect to W̅,



 
This implies that decision vector is sum of all sample vector.
Differentiate the same equation with respect to b,



What this maximisation expression depicts
  1. Maximisation depends on the sample vectors.
  2. And optimisation of this depends only on the dot product of pairs of samples.

Decision vector also depends on dot product of sample vector and unknown sample.
This can be proved as,


If the positive samples and negative samples are mixed together, a straight line will not work. So in that scenario, quickly lift up the table throwing the samples in the air. While the samples are in the air and thrown up in just the right way, use a large sheet of paper to divide the balls in the air. Lifting up the table is the equivalent of mapping our data into higher dimensions. It can be called as transformation. In this case, we go from the 2 dimensional table surface to the 3 dimensional samples in the air.

By using a kernel we have a nice way to operate in higher dimensions. The large sheet of paper is still called a hyper plane, but it is now a function for a plane rather than a line. SVM does its thing, maps them into a higher dimension and then finds the hyper plane to separate the classes.

Let’s consider this new transformation as Φ(X̅).
But we never used X̅ vector individually anywhere in the above process. We always used it as a dot product of some other vector.
So kernel provides a function in order to find out that dot product using the following equation
K(X,Z) = Φ(X̅) . Φ(Z̅)



No comments:

Post a Comment

Total Pageviews