Let’s understand the basics of Linear Regression Algorithm
Did you know that when you are Implementing a machine learning algorithm using a library like sklearn, you are calling the sklearn methods and not implementing it from scratch.
In this article, I will be implementing a Linear Regression Machine Learning model without relying on Python’s easy-to-use sklearn library. This post aims to discuss the fundamental mathematics and statistics behind a Linear Regression model. I hope this will help us fully understand how Linear Regression works in the background.
Note that this is one of the posts in the series Machine Learning from Scratch. You may like to read other similar posts like Gradient Descent From Scratch, Logistic Regression from Scratch, Decision Tree from Scratch, Neural Network from Scratch
You may like to watch this article as video, in more detail as below
Let us first discuss a few statistical concepts used in this post.
Mean: mean or an average of a data set is calculated by adding all numbers in the data set and then dividing by the number of values in the data set.
Variance: Variance is the measure of the spread between numbers in a data set. In other words, it means how far each number in the set is from the mean.
Covariance: Covariance is the measure of the directional relationship between two random variables. In other words, covariance measures how much two random variables vary together.
We are going to import NumPy and the pandas library.
import numpy as np
import pandas as pd
We will be using pandas to load the CSV data to a pandas data frame.
df = pd.read_csv('Linear-Regression-Data.csv')
Calculate the mean for Years of Experience and Salary:
mean_yoe = sum(df['YearsExperience']) / float(len(df['YearsExperience']))
mean_salary = sum(df['Salary']) / float(len(df['Salary']))
Define the variance function:
def variance(values, mean):
return sum([(val-mean)**2 for val in values])
Define the covariance function for Years of Experience and Salary:
def covariance(yearsexperience, mean_yoe, salary , mean_salary):
covariance = 0.0
for r in range(len(yearsexperience)):
covariance = covariance + (yearsexperience[r] - mean_yoe) * (salary[r] - mean_salary)
Calculate the variance:
Let us calculate the variance of years of experience and salary columns.
variance_yoe, variance_salary = variance(df['YearsExperience'], mean_yoe), variance(df['Salary'], mean_salary)
variance_yoe , variance_salary
Calculate the covariance:
Let us calculate the covariance between years of experience and salary.
covariance_yoe_salary = covariance(df['YearsExperience'],mean_yoe,df['Salary'],mean_salary)
As we know that equation of a line is as below
Y = mX + c
Where Y is the dependent variable ( here, salary)
X is the independent variable ( here, years of experience)
m is the slope of the line and c is the constant.
m = covariance_yoe_salary/ variance_yoe
c = mean_salary - m * mean_yoe
We can calculate the values of m and c as per the above formula
Finally, we can predict the salary of a person having 4 years of experience using this model as below
In this article, I built a Linear Regression model from scratch without using sklearn library. However, if you will compare it with sklearn’s implementation, it will give nearly the same result.
You can find the code related to this article here
Additionally, you may like to watch how to implement Gradient Descent from Scratch in python.
Happy Coding !!