Linear Regression using Python in 10 lines
Do you think that salary of a person is linearly related to his/her years of experience?
Whether your answer is yes or no, I am sure you will want to confirm the right answer… :)
So lets predict how your salary may increase as your years of experience increases :)
So lets find out the salaries and years of experience of 20–30 people from internet (there are many sites where people post their salaries anonymously).
Lets plot the points on a two dimensional plane with salary on Y-axis and Years of Experience on X-axis. You may get the red points like data points as shown in the graph above.
Well, you might have noticed a outlier (a point that is far away from all other points) seems like not fitting into the pattern of all other points. This is called outlier as it lies outside the pattern (the blue straight line here).
The outlier is saying that there is someone having very good salary even if the years of experience is less ( may be a data scientist :) )
As it appears that we can fit a straight line to the red dots (of course we need to ignore outlier) we will use the linear regression in python and scikit learn to find out the best fit line.
Equation of a line is like y =mx +c
(Do you think its time to revise high school math specially Linear Algebra? if yes …go for it) where x is years of experience and y is salary , and m and c are two constants whose values we will calculate .
Then we can find salary of any person …provided we know their years of experience … :)
Enough of theory …Lets do some coding …I know many of you are wondering where is those 10 lines of python code promised in the title of this post….
Here it is….
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load DataSet
dataset = pd.read_csv(‘data.csv’)
# Separate the independent variable (x) and dependent variable (y)
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
# Split the data into train and test part
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.2, random_state = 0)
# Create the model
model = LinearRegression()
# Predict using test data
yPred = model.predict(xTest)
As promised the above code is 10 lines. We could do this in 10 lines as Scikit Learn functions have done mapping of the data points to a best fit straight line and also calculated the constants m and c for the line under the hood.
But let me tell you there are lots of things that we need to do after we created and tested our model on test data.
You should keep exploring linear regression more and more and I will come up with more detailed post soon.
Additionally, you may like to watch how to implement Linear Regression from Scratch in python without using sklearn
You can download the data.csv used in this post from here
Happy coding and happy learning :)