Logistic Regression#

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

given x, find \(\hat{y} = p(y=1 | x)\)

where \(0 \leq \hat{y} \leq 1\) and \(x \in R^{n_x}\)

Parameters : \(w \in R^{n_x}\), \(b \in R\)

Output : \(\hat{y} = \sigma(w^T x + b)\)

[1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
import seaborn as sns

[2]:

def linear_model_format_X(X):
    if len(X.shape) == 1:
        X = X.copy().reshape(-1,1)

    return np.hstack(tup= ( np.ones(shape=(X.shape[0],1)) , X ) )

Cost Function#

The cross-entropy loss function#

We need a loss function that expresses, for an observationx, how close the classifier output

\(\hat{y} = σ(w^T x+b)\)

is to the correct output (y, which is 0 or 1). We’ll call this:

\(L(\hat{y},y)\) = How much \(\hat{y}\) differs from the true \(y\)

We do this via a loss function that prefers the correct class labels of the training examples to bemore likely. This is called conditional maximum likelihood estimation: we choose the parametersw,b thatmaximize the log probability ofthe true y labels in the training datagiven the observations x. The resulting loss function is the negative log likelihood loss, generally called the cross-entropy loss.

\(J(\theta)=\frac{1}{m} {\sum_{i=0}^{m}cost(h_{\theta}(x^{(i)}),y^{(i)})}\)

for non linear paramter there can be many local minima

\(cost(h_{\theta}(x), y) = \begin{cases} -\log(h_{\theta}(x)) & \text{if } y=1\\ -\log(1 - h_{\theta}(x)) & \text{if } y=0\\ \end{cases}\)

\(cost(h_{\theta}(x), y) = -y \log(h_{\theta}(x)) - (1 - y)\log(1 - h_{\theta}(x))\)

\(J(\theta) = -\frac{1}{m}{\sum_{i=0}^{m} \big[ y^{(i)} \log(h_{\theta}(x^{(i)})) - (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \big] }\)

MAXIMUM LIKELIHOOD ESTIMATION

[3]:

def calculate_entropy_cost(y_pred,y):
    m = y_pred.shape[0]
    part_1 = y * np.log(y_pred)

    part_2 = (1 - y) * np.log(1 - y_pred)

    cost = ( -1 / m ) * np.sum(part_1 + part_2)
    return cost

Sigmoid Function#

\(\sigma(z)= \frac{1}{1 + e^{-z}}\)

if z is positive large \(\sigma(z) \approx \frac{1}{1 + 0} \approx 1\)

if z is negative large \(\sigma(x) = \frac{1}{1 + e^{-z}} \approx 0\) here \(e^{-z}\) is a very big number

in case of hypothesis \(h_{\theta}(x) = \frac{1}{1 + e^{- \theta^T x}}\)

[4]:

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

[5]:

data = np.linspace(-10,10,100)
plt.plot(data, sigmoid(data),'k.-',alpha=0.6)
plt.grid()

../_images/LogisticRegression_Explore_14_0.png

Gradient Descent Algorithm#

minimize \(J(\theta) \rightarrow\) get new theta \(\theta \rightarrow\) calculate \(h_{\theta}(x) = \frac{1}{1 + e^{-\theta^T x}} \rightarrow\) minimize \(J(\theta)\)

\begin{align} \text{repeat until convergence \{}\\ \theta_j &:= \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta)\\ \text{\}}\\ \end{align}

\(\frac {\partial}{\partial\theta_j} \bigg[-\frac{1}{m}{\sum_{i=0}^{m} \big[ y^{(i)} \log(h_{\theta}(x^{(i)})) - (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \big]} \bigg]\)

Math for Gradient Descent#

\begin{align} L &= -\frac{1}{m}{\sum_{i=1}^{m} \big[ y^{(i)} \log\hat{y} - (1 - y^{(i)}) \log(1 - \hat{y} ) \big]}\\ \hat{y} &= \frac{1}{1 + e^{- h_{\theta}(x)}}\\ h_{\theta}(x) &= \theta^T x\\ \\ \hat{y} & \text{ depends on } \theta\\ y & \text{ doesn't depend on } \theta \end{align}

task is to calculate

\(\frac{\partial L }{\partial \theta} = -\frac{1}{m} \frac{\partial}{\partial \theta} \big( y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})} \big)\)

\begin{align} \frac{\partial \hat{y}}{\partial \theta} &= \frac{\partial}{\partial \theta} \big( \frac{1}{1 + e^{-h_{\theta}(x)}} \big)\\ &= - \frac{1}{(1 + e^{-h_{\theta}(x)})^2} \frac{\partial}{\partial \theta} (1 + e^{-h_{\theta}(x)})\\ &= \frac{e^{-h_{\theta}(x)} x}{(1 + e^{-h_{\theta}(x)})^2}\\ &= \hat{y}^2 e^{-h_{\theta}(x)} x \\ \\ \frac{\partial L }{\partial \theta} =& -\frac{1}{m} \frac{\partial}{\partial \theta} \big( y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})} \big)\\ =& -\frac{1}{m} \bigg[ \frac{\partial}{\partial \theta} ( y \log{\hat{y}}) + \frac{\partial}{\partial \theta}((1 - y) \log{(1 - \hat{y})}) \bigg]\\ =& -\frac{1}{m} \bigg[ \frac{y}{\hat{y}} \frac{\partial}{\partial \theta} \hat{y} + \frac{(1 - y)}{(1 - \hat{y})} \frac{\partial}{\partial \theta}(1 - \hat{y}) \bigg]\\ =& -\frac{1}{m} \bigg[ \frac{y}{\hat{y}} \frac{\partial}{\partial \theta} \hat{y} - \frac{(1 - y)}{(1 - \hat{y})} \frac{\partial}{\partial \theta} \hat{y} \bigg]\\ =& -\frac{1}{m} \bigg[ \frac{y}{\hat{y}} \hat{y}^2 e^{-h_{\theta}(x)} x - \frac{(1 - y)}{(1 - \hat{y})} \hat{y}^2 e^{-h_{\theta}(x)} x \bigg]\\ \\ \\ 1 - \hat{y} &= 1 - \frac{1}{1 + e^{-h_{\theta}(x)}}\\ &= \frac{1 + e^{-h_{\theta}(x)} - 1}{1 + e^{-h_{\theta}(x)}}\\ &= \hat{y} e^{-h_{\theta}(x)}\\ \\ \\ \frac{\partial L }{\partial \theta} =& -\frac{1}{m} \bigg[ \frac{y}{\hat{y}} \hat{y}^2 e^{-h_{\theta}(x)} x - \frac{(1 - y)}{\hat{y} e^{-h_{\theta}(x)}} \hat{y}^2 e^{-h_{\theta}(x)} x \bigg]\\ =& -\frac{1}{m} \bigg[ \frac{y}{\hat{y}} \hat{y}^2 e^{-h_{\theta}(x)} x - \frac{(1 - y)}{\hat{y} e^{-h_{\theta}(x)}} \hat{y}^2 e^{-h_{\theta}(x)} x \bigg]\\ =& -\frac{1}{m} \bigg[ y \hat{y} e^{-h_{\theta}(x)} x - (1 - y)\hat{y} x \bigg]\\ =& -\frac{1}{m} \bigg[ y \hat{y} e^{-h_{\theta}(x)} x - \hat{y} x + y\hat{y} x \bigg]\\ =& -\frac{1}{m} \bigg[ y \hat{y} e^{-h_{\theta}(x)} - \hat{y} + y\hat{y} \bigg]x\\ =& -\frac{1}{m} \bigg[ y \hat{y}(1 + e^{-h_{\theta}(x)}) - \hat{y} \bigg]x\\ \frac{\partial L }{\partial \theta} =& -\frac{1}{m} \bigg[ y - \hat{y} \bigg]x \end{align}

[6]:

class LogisticRegression:
    def __init__(self,alpha = 0.01 ,iterations = 10000):
        self.alpha = alpha
        self.iterations = iterations
        self._theta = None
        self._X = None
        self._y = None
        self._theta_history = None
        self._cost_history = None

    def _format_X_for_theta_0(self,X_i):

        X_i = X_i.copy()
        if len(X_i.shape) == 1:
            X_i = X_i.reshape(-1,1)

        if False in (X_i[...,0] == 1):
            return np.hstack(tup=(np.ones(shape=(X_i.shape[0],1)) , X_i))
        else:
            return X_i

    @property
    def X(self):
        return self._X

    @property
    def y(self):
        return self._y

    @property
    def theta(self):
        return self._theta

    @property
    def theta_history(self):
        return self._theta_history

    @property
    def cost_history(self):
        return self._cost_history

    def predict(self,X):
        format_X = self._format_X_for_theta_0(X)

        if format_X.shape[1] == self._theta.shape[0]:
            y_pred = sigmoid(format_X @ self._theta) # (m,1) = (m,n) * (n,1)
            return y_pred
        elif format_X.shape[1] == self._theta.shape[1]:
            y_pred = sigmoid(format_X @ self._theta.T) # (m,1) = (m,n) * (n,1)
            return y_pred
        else:
            raise ValueError("Shape is not proper.")


    def train(self, X, y, verbose=True, method="BGD", theta_precision = 0.001, batch_size=30, regularization=False, penalty=1.0):

        self._X = self._format_X_for_theta_0(X)
        self._y = y

        # number of features+1 because of theta_0
        self._n = self._X.shape[1]
        self._m = self._y.shape[0]

        self._theta_history = []
        self._cost_history = []

        if method == "BGD":
            self._theta = np.random.rand(1,self._n) * theta_precision
            if verbose: print("random initial θ value :",self._theta)

            for iteration in range(self.iterations):
                # calculate y_pred
                y_pred = self.predict(self._X)

                # new θ to replace old θ
                new_theta = None

                # simultaneous operation

                if regularization:
                    gradient = np.mean( ( y_pred - self._y ) * self._X, axis = 0 )
                    new_theta_0 = self._theta[:,[0]] - (self.alpha *  gradient[0])
                    new_theta_rest = self._theta[:,range(1,self._n)] * (1 - (( self.alpha * penalty)/self._m) ) - (self.alpha *  gradient[1:])
                    new_theta = np.hstack((new_theta_0,new_theta_rest))

                else:
                    gradient = np.mean( ( y_pred - self._y ) * self._X, axis = 0 )
                    new_theta = self._theta - (self.alpha *  gradient)

                if np.isnan(np.sum(new_theta)) or np.isinf(np.sum(new_theta)):
                    print("breaking. found inf or nan.")
                    break
                # override with new θ
                self._theta = new_theta

                # calculate cost to put in history
                cost = calculate_entropy_cost(y_pred = self.predict(X=self._X), y = self._y)
                self._cost_history.append(cost)

                # calcualted theta in history
                self._theta_history.append(self._theta[0])

        elif method == "SGD": # stochastic gradient descent
            self._theta = np.random.rand(1,self._n) * theta_precision
            if verbose: print("random initial θ value :",self._theta)

            for iteration in range(self.iterations):

                # creating indices for batches
                indices = np.random.randint(0,self._m,size=batch_size)

                # creating batch for this iteration
                X_batch = np.take(self._X,indices,axis=0)
                y_batch = np.take(self._y,indices,axis=0)

                # calculate y_pred
                y_pred = self.predict(X_batch)
                # new θ to replace old θ
                new_theta = None

                # simultaneous operation
                if regularization:
                    gradient = np.mean( ( y_pred - y_batch ) * X_batch, axis = 0 )
                    new_theta_0 = self._theta[:,[0]] - (self.alpha *  gradient[0])
                    new_theta_rest = self._theta[:,range(1,self._n)] * (1 - (( self.alpha * penalty)/self._m) ) - (self.alpha *  gradient[1:])
                    new_theta = np.hstack((new_theta_0,new_theta_rest))


                else:
                    gradient = np.mean( ( y_pred - y_batch ) * X_batch, axis = 0 )
                    new_theta = self._theta - (self.alpha *  gradient)

                if np.isnan(np.sum(new_theta)) or np.isinf(np.sum(new_theta)):
                    print("breaking. found inf or nan.")
                    break
                # override with new θ
                self._theta = new_theta

                # calculate cost to put in history
                cost = calculate_entropy_cost(y_pred = self.predict(X=X_batch), y = y_batch)
                self._cost_history.append(cost)

                # calcualted theta in history
                self._theta_history.append(self._theta[0])

        else:
            print("No Method Defined.")

[7]:

from sklearn.datasets import make_classification

X,y = make_classification(n_samples=500,n_features=4)
y = y.reshape(-1,1)
X.shape, y.shape

[7]:

((500, 4), (500, 1))

[8]:

df = pd.DataFrame(X)
df['y'] = y
df.head()

[8]:

	0	1	2	3
0	-1.394862	1.068507	-1.687573	-1.411072
1	0.317359	1.075521	0.914416	-1.940417
2	-1.540031	1.312455	-1.809805	-1.785586
3	-0.802099	1.197226	-0.735972	-1.810919
4	-0.235792	-0.109292	-0.401901	0.258678

[9]:

%matplotlib inline
fig = plt.figure(figsize=(10,10))

ax = plt.axes(projection='3d')

ax.scatter(X[...,0],X[...,1],X[...,2],c=y)

plt.show()

../_images/LogisticRegression_Explore_25_0.png

[10]:

sns.pairplot(df,hue='y')

[10]:

<seaborn.axisgrid.PairGrid at 0x7f21a03bf9d0>

../_images/LogisticRegression_Explore_26_1.png

[11]:

from sklearn.metrics import confusion_matrix,accuracy_score

Logitsic Regression without Regularization#

Batch Gradient Descent#

[31]:

logisitc_reg_model1 = LogisticRegression(alpha=0.1, iterations=2000)
logisitc_reg_model1.train(X=X, y=y, method="BGD", verbose=False)

y_pred = logisitc_reg_model1.predict(X)
theta = logisitc_reg_model1.theta
theta_history = logisitc_reg_model1.theta_history
cost_history = logisitc_reg_model1.cost_history


print("Fit theta :",theta)

print(f"""
Confusion Matrix :
{confusion_matrix(y,y_pred>0.5)}
Accuracy Score   :
{accuracy_score(y,y_pred>0.5)}
""")


fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(2,2,1)
ax.set(
    xlabel="iterations",
    ylabel="loss :j(θ)"
)
ax.plot(cost_history)
ax.grid(alpha=0.5)

ax = fig.add_subplot(2,2,2)
ax.set(
    xlabel="y_pred",
    ylabel="y"
)
ax.scatter(y_pred,y)
ax.grid(alpha=0.5)

ax = fig.add_subplot(2,2,(3,4))
ax.set(
    ylabel="loss: j(θ)",
    xlabel="θ"
)
ax.plot(theta_history,cost_history)
ax.grid(alpha=0.5)
plt.show()

Fit theta : [[-0.21130795  0.16250085 -2.23739897 -0.65426128  3.7890939 ]]

Confusion Matrix :
[[243   9]
 [  6 242]]
Accuracy Score   :
0.97

../_images/LogisticRegression_Explore_30_1.png

Stochastic Gradient Descent#

[22]:

logisitc_reg_model2 = LogisticRegression(alpha=0.1, iterations=2000)
logisitc_reg_model2.train(X=X, y=y, method="SGD",batch_size=200, verbose=False)

y_pred = logisitc_reg_model2.predict(X)
theta = logisitc_reg_model2.theta
theta_history = logisitc_reg_model2.theta_history
cost_history = logisitc_reg_model2.cost_history


print("Fit theta :",theta)

print(f"""
Confusion Matrix :
{confusion_matrix(y,y_pred>0.5)}
Accuracy Score   :
{accuracy_score(y,y_pred>0.5)}
""")


fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot(2,2,1)
ax.set(
    xlabel="iterations",
    ylabel="loss :j(θ)"
)
ax.plot(cost_history)
ax.grid(alpha=0.5)

ax = fig.add_subplot(2,2,2)
ax.set(
    xlabel="y_pred",
    ylabel="y"
)
ax.scatter(y_pred,y)
ax.grid(alpha=0.5)

ax = fig.add_subplot(2,2,(3,4))
ax.set(
    ylabel="loss: j(θ)",
    xlabel="θ"
)
ax.plot(theta_history,cost_history)
ax.grid(alpha=0.5)
plt.show()

Fit theta : [[-0.2082787   0.17501482 -2.23356835 -0.63291115  3.77958104]]

Confusion Matrix :
[[243   9]
 [  6 242]]
Accuracy Score   :
0.97