Simple Linear Regression in Python

Simple Linear Regression in Python

Read Time2 Minute, 20 Second

In this article, we will see step-by-step process of Simple Linear Regression using Python.

If you don’t like to read, here’s the video on the same.

You can read all the steps or can get the code on GitHub.

Let’s start coding step by step.

Importing the dataset (created in last video) and libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #for visualization

dataset=pd.read_csv('op.csv')


X=dataset.iloc[:,2]
y=dataset.iloc[:,3]

Replacing nan in numerical columns by mean value of the column values.

X.fillna(X.mean(),inplace=True)
y.fillna(y.mean(),inplace=True)

Now the missing values are replaced by the mean of other values from the column.

Now we split the data.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)

for visualization we will need X_train and X_test in original form (before scaling).

X_train1=X_train
X_test1=X_test

Depending on the algorithm, we should choose whether to scale the data features(columns) or not. Now we will scale. The reason is to prevent the data leakage from X_train to X_test. We splitted the data and then doing scaling.

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train=X_train.values.reshape(-1,1)
X_test=X_test.values.reshape(-1,1)
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

This is why we first splitted the dataset and then did fit_transform on training data and just transform it on a test set to prevent the leakage to the test set.

Now you can use any algorithm to do your task. Preprocessing ends here for this dataset. We are doing Simple Linear Regression.

from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(X_train,y_train)

#predicting test set results

y_pred=regressor.predict(X_test)

#Plotting the graphs first for train data

plt.scatter(X_train1,y_train,color='red')
plt.plot(X_train1,regressor.predict(X_train),color='blue')
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()

#decently fit
#plotting now for train data
plt.scatter(X_test1,y_test,color='red')
plt.plot(X_train1,regressor.predict(X_train),color='blue')
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()

from sklearn.metrics import r2_score
print(r2_score(y_test,y_pred))

That’s it. Now you have learned to implement Simple Linear Regression in Python. You can build a similar model for predicting missing data as well.

Also read: Handling Missing Numerical Data Using SimpleImputer