Linear Regression Machine Learning Algorithm from scratch

Let’s understand the basics of Linear Regression Algorithm

Did you know that when you are Implementing a machine learning algorithm using a library like sklearn, you are calling the sklearn methods and not implementing it from scratch.

In this article, I will be implementing a Linear Regression Machine Learning model without relying on Python’s easy-to-use sklearn library. This post aims to discuss the fundamental mathematics and statistics behind a Linear Regression model. I hope this will help us fully understand how Linear Regression works in the background.

Linear Regression from Scratch without sklearn
Linear Regression from Scratch without sklearn

Note that this is one of the posts in the series Machine Learning from Scratch. You may like to read other similar posts like Gradient Descent From Scratch, Logistic Regression from Scratch, Decision Tree from Scratch, Neural Network from Scratch

You may like to watch this article as video, in more detail as below

Let us first discuss a few statistical concepts used in this post.

Mean: mean or an average of a data set is calculated by adding all numbers in the data set and then dividing by the number of values in the data set.

Variance: Variance is the measure of the spread between numbers in a data set. In other words, it means how far each number in the set is from the mean.

Covariance: Covariance is the measure of the directional relationship between two random variables. In other words, covariance measures how much two random variables vary together.

We are going to import NumPy and the pandas library.

import numpy as np
import pandas as pd

We will be using pandas to load the CSV data to a pandas data frame.

df = pd.read_csv('Linear-Regression-Data.csv')
Dataset for Linear Regression From Scratch

Calculate the mean for Years of Experience and Salary:

mean_yoe = sum(df['YearsExperience']) / float(len(df['YearsExperience']))
mean_salary = sum(df['Salary']) / float(len(df['Salary']))
The calculating mean of a pandas data frame column

Define the variance function:

def variance(values, mean):
return sum([(val-mean)**2 for val in values])

Define the covariance function for Years of Experience and Salary:

def covariance(yearsexperience, mean_yoe, salary , mean_salary):
covariance = 0.0
for r in range(len(yearsexperience)):
covariance = covariance + (yearsexperience[r] - mean_yoe) * (salary[r] - mean_salary)
return covariance

Calculate the variance:

Let us calculate the variance of years of experience and salary columns.

variance_yoe, variance_salary = variance(df['YearsExperience'], mean_yoe), variance(df['Salary'], mean_salary)
variance_yoe , variance_salary

Calculate the covariance:

Let us calculate the covariance between years of experience and salary.

covariance_yoe_salary = covariance(df['YearsExperience'],mean_yoe,df['Salary'],mean_salary)
Calculating the variance and covariance of pandas data columns

As we know that equation of a line is as below

Y = mX + c

Where Y is the dependent variable ( here, salary)

X is the independent variable ( here, years of experience)

m is the slope of the line and c is the constant.

m = covariance_yoe_salary/ variance_yoe
c = mean_salary - m * mean_yoe

We can calculate the values of m and c as per the above formula

Finding the coefficients for Linear Regression

Finally, we can predict the salary of a person having 4 years of experience using this model as below

Prediction using Linear Regression Model

In this article, I built a Linear Regression model from scratch without using sklearn library. However, if you will compare it with sklearn’s implementation, it will give nearly the same result.

You can find the code related to this article here

Additionally, you may like to watch how to implement Gradient Descent from Scratch in python.

Happy Coding !!

Data Scientist & Machine Learning Evangelist. I like to mess with data.