# Logistic Regression from scratch using Python

This topic explains the method to perform binary classification using logistic regression from scratch using python.

What is Logistic Regression? Why it is used for classification?

Logistic regression is a statistical model used to analyze the dependent variable is dichotomous (binary) using logistic function. As the logistic or sigmoid function used to predict the probabilities between 0 and 1, the logistic regression is mainly used for classification.

What is Logistic or Sigmoid Function?

As per Wikepedia, “A sigmoid function is a mathematical function having a characteristic “S”-shaped curve or sigmoid curve.” The output of sigmoid function results from 0 to 1 in a continous scale.  Why we need to use cross entropy cost function rather than mean squared error for logistic regression?

Cross-entropy cost function measures the performance of a classification model whose output is a probability value between 0 and 1. It is also called log loss. In linear regression, we need to minimize the mean squared error using any optimization algorithm because the cost function is a convex function. It has only one local or global minima. In logistic regression, if we use mean square error cost function with logistic function, it provides non-convex outcome which results in many local minima. cross entropy cost function with logistic function gives convex curve with one local/global minima.

As per the below figures, cost entropy function can be explained as follows:

1) if actual y = 1, the cost or loss reduces as the model predicts the exact outcome.

2) if actual y = 0, the cost pr loss increases as the model predicts the wrong outcome.

So If we join both the below curves, it is a convex with one global minima to predict the correct outcome (0 or 1)  How to determine the number of model parameters?

1) The number of model parameters(Theta) depends upon the number of independent variables.

2) For example, if we need to perform claasification using linear decision boundary and 2 independent variables available, the number of model parameters is 3.

How to determine the decision boundary for logistic regression?

Decision boundary is calculated as follows: Below is an example python code for binary classification using Logistic Regression

``````import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
``````

Function to create random data for classification

``````def random():
X1 = []
X2 = []
y = []

np.random.seed(1)
for i in range(0,20):
X1.append(i)
X2.append(np.random.randint(100))
y.append(0)

for i in range(20,50):
X1.append(i)
X2.append(np.random.randint(80,300))
y.append(1)

return X1,X2,y

``````
``````def standardize(data):
data -= np.mean(data)
data /= np.std(data)
return data
``````
``````def plot(X):
plt.scatter(X[:,0],X[:,1])
plt.xlabel('X1',fontweight="bold",fontsize = 15)
plt.ylabel('X2',fontweight="bold",fontsize = 15)
plt.title("Scatter Data",fontweight="bold",fontsize = 20)
plt.show()
``````

Sigmoid Function used for Binary Classification

``````def sigmoid(X,theta):
z = np.dot(X,theta.T)
return 1.0/(1+np.exp(-z))
``````

Cross-entropy cost function measures the performance of a classification model whose output is a probability value between 0 and 1. It is also called log loss.

``````def cost_function(h,y):
loss = ((-y * np.log(h))-((1-y)* np.log(1-h))).mean()
return loss
``````

Gradient descent algorithm used to optimize the model parameters(theta) by minimizing the log loss.

``````def gradient_descent(X,h,y):
return np.dot(X.T,(h-y))/y.shape
``````
``````def update_loss(theta,learning_rate,gradient):
``````
``````def predict(X,theta):
threshold = 0.5
outcome = []
result = sigmoid(X,theta)
for i in range(X.shape):
if result[i] <= threshold:
outcome.append(0)
else:
outcome.append(1)
return outcome
``````
``````def plot_cost_function(cost):
plt.plot(cost,label="loss")
plt.xlabel('Iteration',fontweight="bold",fontsize = 15)
plt.ylabel('Loss',fontweight="bold",fontsize = 15)
plt.title("Cost Function",fontweight="bold",fontsize = 20)
plt.legend()
plt.show()
``````
``````def plot_predict_classification(X,theta):
plt.scatter(X[:,1],X[:,2])
plt.xlabel('X1',fontweight="bold",fontsize = 15)
plt.ylabel('X2',fontweight="bold",fontsize = 15)
x = np.linspace(-1.5, 1.5, 50)
y = -(theta + theta*x)/theta
plt.plot(x,y,color="red",label="Decision Boundary")
plt.title("Decision Boundary for Logistic Regression",fontweight="bold",fontsize = 20)
plt.legend()
plt.show()
``````
``````if __name__ == "__main__":

X1,X2,y = random()

X1 = standardize(X1)
X2 = standardize(X2)

X = np.array(list(zip(X1,X2)))

y = np.array(y)

plot(X)

# Feature Length
m = X.shape

# No. of Features
n = X.shape

# No. of Classes
k = len(np.unique(y))

# Initialize intercept with ones
intercept = np.ones((X.shape,1))

X = np.concatenate((intercept,X),axis= 1)

# Initialize theta with zeros
theta = np.zeros(X.shape)

num_iter = 1000

cost = []

for i in range(num_iter):
h = sigmoid(X,theta)
cost.append(cost_function(h,y))

outcome = predict(X,theta)

plot_cost_function(cost)

print("theta_0 : {} , theta_1 : {}, theta_2 : {}".format(theta,theta,theta))

metric = confusion_matrix(y,outcome)

print(metric)

plot_predict_classification(X,theta)
``````  Calculated Model Parameters:

``````theta_0 : 1.731104110180229 , theta_1 : 3.384426535937368, theta_2 : 2.841095441821299
``````

Confusion Matrix:

``````[[20  0]
[ 0 30]]
``````  ### References :

1. https://en.wikipedia.org/wiki/Logistic_regression
2. https://en.wikipedia.org/wiki/Sigmoid_function
3. https://en.wikipedia.org/wiki/Logistic_function