# Support Vector Machine#

## Assumptions#

• Binary Classification y = {-1, 1}

• Data is linearly Separable

## Concept# • Defining a linear classifier \begin{align*} h(x) = sign(w^T x + b) \end{align*}

• binary classification with labels -1 and +1

• Typically data is linearly separable and there exist a plane with dimension n - 1 (where n is dimenstion of column space). This plane is defined by Maximum Margin Hyperplane (Hyperplane is a subspace whose dimension is one less than its ambient space)

• This mentioned Hyperplane is denoted by $$H$$ or $$\mathcal{H}$$, is defined by $$\vec{w}$$

• As perceptron also creates a hyperplane between the classes to linearly separate them. but it is needed to create a hyperplane that is optimum.

• In above diagram we can see 3 hyperplanes. hyperplane 2 and 3 also separates the classes, but the test data point is wrongly classified with hyperplane 2’s scenario. In essence, explaining that plane 2 and 3 are not generalized or optimum solutions.

### Maximum Margin Hyperplane#

• The plane maximizing the distance to the closest point from both classes

• Hyperplane is good if it maximizes the margin (which is minimum distance from both classes)

## Margin# Hyperplane is defined by $$\vec{w}$$ as

\begin{align} \mathcal{H} &= \{x | w^T x + b = 0\} \end{align}

Lets consider there is a point x, where $$\vec{d}$$ is vector from Hyperplance $$H$$ to x representing minimum length.
Here $$x_p$$ is projection of x onto Hyperplane $$H$$

\begin{align} \vec{x_p} &= \vec{x} - \vec{d} \\ \because x_p \in H \text{, point on the hyperplane } \therefore w^T x_p + b &= 0 \\ w^T ( \vec{x} - \vec{d} ) + b &= 0\\ \\ \because \text{ d is rescaled vector of w and parallel to it } \therefore \vec{d} &= \alpha \vec{x} \\ \\ w^T ( \vec{x} - \alpha \vec{w} ) + b &= 0\\ \Rightarrow \alpha &= \frac{w^T x + b}{w^T w}\\ \\ \vec{d} &= \frac{w^T x + b}{w^T x} . \vec{x}\\ ||d||_2 &= \sqrt{d^T d} = \sqrt{\alpha^2 w^T w}\\ &= \alpha \sqrt{w^T w} = \frac{w^T x + b}{w^T x} \sqrt{w^T w}\\ ||d||_2 &= \frac{w^T x + b}{\sqrt{w^T w}}\\ ||d||_2 &= \frac{w^T x + b}{||w||_2}\\ \\ \end{align}

Let $$\gamma$$ be the minimum distance from the hyperplane and closest point from both classes

\begin{align} \text{margin } \gamma(w, b) = {min\atop{x \in D}} \frac{w^T x + b}{||w||_2} \end{align}

## Algorithm#

Now according to the definition of maximum margin classifier

\begin{align} {max\atop{w, b}} \gamma(w, b) \end{align}

But there is a problem with this definition as if we just want to increase the margin it can be anywhere (like $$\infty$$ )

This means we need to constraint the hyperplane to be between the classes or the plane of data (necessarily)

Now

\begin{align} {max\atop{w, b}} \gamma(w,b) && \text{ s.t. } && \forall i : y_i (w^T x_i + b) \geq 0 \end{align}

(maximizing the margin to closest points from both classes, such that for all i/samples the classification is correct)

we added constraint to make sure the hyperplane is between the classes, as we know $$y \in \{-1, 1\}$$

if $$w^T x + b = -1$$ and correctly classified then $$-1 \times -1 = 1$$ and for label 1 as well.

if we move forward with this, then

\begin{align} {max\atop{w, b}} \big[ {min\atop{x \in D}} \frac{|w^T x + b|}{w^ T w} \big] && \mid && \forall i : y_i (w^T x_i + b) \geq 0 \\ \\ \underbrace{{max\atop{w, b}} \frac{1}{w^T w} \big[ {min\atop{x \in D}} |w^T x + b| \big]}_{\text{maximizing the margin}} && \mid && \forall i : \underbrace{y_i (w^T x_i + b) \geq 0}_{\text{correct classifications}} \end{align}

### Special Trick# \begin{align} \text{Hyplane defined as } H &= \{ x : w^T x + b = 0\} \\ \\ {max\atop{w, b}} \frac{1}{w^T w} \big[ {min\atop{x \in D}} |w^T x + b| \big] && \mid && \forall i : y_i (w^T x_i + b) \geq 0 \end{align}

• Hyperplane is scale invariant

• If we rescale the value of w and b to any positive value (above diagram) then the result is going to be the same (means there is no unique solution)

• Hence we can fix values of w and b such that

\begin{align} {\min\atop{x \in D}} | w^T x + b | = 1 \end{align}

\begin{align} \\ \therefore {min\atop{w, b}} w^T w && \mid \forall i: y_i(w^T x_i + b) \geq 0 && && {min\atop{x \in D}} | w^T x + b | = 1 \\ \\ & \Updownarrow \\ \\ {min\atop{w, b}} w^T w && \mid \forall i: y_i(w^T x_i + b) \geq 1 \end{align}

### SVM with soft contraints#

• what if maximum margin or perfectly separating hyperplane doesn’t exist (ie there will always be one data point misclassified by the algorithm)

• We’ll introduce a slack variable

\begin{align*} {min \atop{w,b}} w^T w + c \sum_{i=1}^{n} \xi_i && \forall i : y_i(w^T x_i + b) \geq 1 - \xi_i \\ && \forall i : \xi_i \geq 0 \end{align*}

\begin{align} \xi_i \begin{cases} 1 - y_i (w^T x_i + b) && \text{if } y_i(w^T x_i + b) \lt 1 (\text{misclassifications, 1 x -1 and -1 x 1})\\ \\ 0 && \text{if } y_i(w^T x_i + b) \geq 1 (\text{correctly classified results}) \end{cases} \end{align}

This becomes

\begin{align} {min\atop{w,b}} \underbrace{w^T w}_{l2 regularizer} &+ c \underbrace{\sum \max(1 - y_i (w^T x_i + b), 0)}_{loss}\\ \\ \text{let } c &= \frac{1}{\lambda} \\ \\ L(w, x, b) &= {min \atop{w, b}} \lambda w^T w + \sum \max(1 - y_i (w^T x_i + b), 0) \\ \\ \Rightarrow & {min \atop{w, b}} \frac{1}{n} \sum_{i=1}^{n} \underbrace{l(h_w(x_i), y_i)}_{loss} + \lambda \underbrace{\gamma{(w)}}_{regularizer} \end{align}

Now this is very similar to logistic regression

\begin{align} L(w, x, b) &= {min \atop{w, b}} \lambda w^T w + \sum \max(1 - y_i (w^T x_i + b), 0) \\ \\ L(w, x, b) &= {min \atop{w, b}} \lambda || w ||^2 + \sum \max(1 - y_i (w.x_i + b), 0) \\ \\ \frac{\delta L}{\delta w} &= \begin{cases} 2 \lambda w - y_i x_i & \text{if } y_i(w^T x_i + b) \lt 1 \\ \\ 2 \lambda w & \text{if } y_i(w^T x_i + b) \ge 1 \end{cases} \\ \\ \frac{\delta L}{\delta b} &= - y_i \end{align}

\begin{align} \text{run until convergence } \big\{ \\ \\ w := w - \alpha \frac{\delta L}{\delta w} \\ b := b - \alpha \frac{\delta L}{\delta b} \\ \\ \big\} \end{align}

### Data Prep#

:

from sklearn.datasets import make_classification
from sklearn import metrics

X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_classes=2, shift=2,
class_sep=3, n_clusters_per_class=1, random_state=42)

y = np.where(y == 0, 1, -1) # to change classes from 0,1 to -1,1

fig = plt.figure(figsize=(7,7))
sns.scatterplot(x=X[:,0], y=X[:,1], hue=y, style=y, ax=ax, palette='deep')
plt.show() ### Implementation#

:

# np.random.seed(3)

class SVM:
def __init__(self, n_iter = 100, lambda_param = 0.01, learning_rate = 0.001):
self.n_iter = n_iter
self.lambda_param = lambda_param
self.learning_rate = learning_rate
self.w = None
self.b = None
self.l_loss = None
self.l_margin = None

@property
def margin(self):
# return 1 / np.linalg.norm(self.w, ord=2)
return 1 / np.sqrt(np.sum(np.square(model.w)))

@property
def loss(self):
return self.l_loss

def fit(self, X, y):
n_samples, n_features = X.shape

self.w = np.zeros(n_features) #np.random.rand(n_features)
self.b = 0
self.l_loss = []
self.l_margin = []

for _ in range(self.n_iter):
for i in range(n_samples):
if (y[i] * ((self.w @ X[i].T) + self.b)) >= 1:
self.w = self.w - ( self.learning_rate * (2 * self.lambda_param * self.w))
else:
self.w = self.w - ( self.learning_rate * (( 2 * self.lambda_param * self.w ) - (y[i] * X[i])))
self.b = self.b + ( self.learning_rate * y[i] )

self.l_loss.append(metrics.hinge_loss(y, self.predict(X)))
self.l_margin.append(self.margin)

def predict(self, X):
return np.sign((self.w @ X.T) + self.b)

def plot_boundary(self, X, ax=None):
# np.random.seed(10)

y_pred = self.predict(X)
margin = self.margin

if ax is None:
fig, ax = plt.subplots(1, 1, figsize=(7, 7))
ax.set_title("Dicision Boundary - Hyperplane")

grid_samples = 100

xx, yy = np.meshgrid(
np.linspace(X[:, 0].min() - 1 , X[:, 0].max() + 1 , grid_samples),
np.linspace(X[:, 1].min() - 1 , X[:, 1].max() + 1 , grid_samples)
)
mesh = np.c_[xx.ravel(), yy.ravel()]
mesh_pred = self.predict(mesh)
zz = mesh_pred.reshape(xx.shape)

ax.contourf(xx, yy, zz, cmap=plt.cm.coolwarm, alpha=0.2)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, style=y, ax=ax, palette='deep')

#         # get the separating hyperplane
#         a = -self.w/self.w
#         xx = np.linspace(X[:, 0].min() - 1, X[:, 0].max() + 1, grid_samples)
#         hyperplane = a * xx - (self.b) / self.w

#         hyperplane_margin_up = hyperplane + a * margin
#         hyperplane_margin_down = hyperplane - a * margin

#         # Plot margin lines
#         ax.plot(xx, hyperplane, 'k-', lw=1)
#         ax.plot(xx, hyperplane_margin_up, 'r--', lw=1)
#         ax.plot(xx, hyperplane_margin_down, 'b--', lw=1)

ax.legend()

def plot_loss(self, ax=None):
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=(7, 7))
ax.set_title("Loss over iterations")

ax.plot(self.loss)

def plot_margin(self, ax=None):
if ax is None:
fig, ax = plt.subplots(1, 1, figsize=(7, 7))
ax.set_title("Margin over iterations")

ax.plot(self.l_margin)

def plot(self, X):
fig, ax = plt.subplots(1, 3, figsize=(15, 5))

self.plot_boundary(X, ax=ax)
ax.set_title("Dicision Boundary - Hyperplane")

self.plot_loss(ax=ax)
ax.set_title("Loss over iterations")

self.plot_margin(ax=ax)
ax.set_title("Margin over iterations")

:

model = SVM(n_iter=2, lambda_param=0.001, learning_rate=0.01)
model.fit(X, y)
model.plot(X) Observations -

• Hyplane is perfectly separating the classes.

• Loss over iterations is decreasing

• Margin over iterations is decreasing

:

fig, ax = plt.subplots(2, 3, figsize=(15, 10))

ax_ = ax.ravel()

for idx, c_ in enumerate([0.001, 0.01, 0.1, 1, 10, 100]):
model = SVM(n_iter=2, lambda_param=c_, learning_rate=0.001)
model.fit(X, y)
model.plot_boundary(X, ax=ax_[idx])
ax_[idx].set_title(f"Lambda param: {c_}")

plt.tight_layout()
plt.show() 