
In this article, we will see step-by-step process of Simple Linear Regression using Python.
If you don’t like to read, here’s the video on the same.
You can read all the steps or can get the code on GitHub.
Let’s start coding step by step.
Importing the dataset (created in last video) and libraries.
import pandas as pd import numpy as np import matplotlib.pyplot as plt #for visualization dataset=pd.read_csv('op.csv') X=dataset.iloc[:,2] y=dataset.iloc[:,3]
Replacing nan in numerical columns by mean value of the column values.
X.fillna(X.mean(),inplace=True) y.fillna(y.mean(),inplace=True)
Now the missing values are replaced by the mean of other values from the column.
Now we split the data.
from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)
for visualization we will need X_train and X_test in original form (before scaling).
X_train1=X_train X_test1=X_test
Depending on the algorithm, we should choose whether to scale the data features(columns) or not. Now we will scale. The reason is to prevent the data leakage from X_train to X_test. We splitted the data and then doing scaling.
from sklearn.preprocessing import StandardScaler sc=StandardScaler() X_train=X_train.values.reshape(-1,1) X_test=X_test.values.reshape(-1,1) X_train=sc.fit_transform(X_train) X_test=sc.transform(X_test)
This is why we first splitted the dataset and then did fit_transform on training data and just transform it on a test set to prevent the leakage to the test set.
Now you can use any algorithm to do your task. Preprocessing ends here for this dataset. We are doing Simple Linear Regression.
from sklearn.linear_model import LinearRegression regressor=LinearRegression() regressor.fit(X_train,y_train) #predicting test set results y_pred=regressor.predict(X_test) #Plotting the graphs first for train data plt.scatter(X_train1,y_train,color='red') plt.plot(X_train1,regressor.predict(X_train),color='blue') plt.title('Age vs Salary') plt.xlabel('Age') plt.ylabel('Salary') plt.show() #decently fit #plotting now for train data plt.scatter(X_test1,y_test,color='red') plt.plot(X_train1,regressor.predict(X_train),color='blue') plt.title('Age vs Salary') plt.xlabel('Age') plt.ylabel('Salary') plt.show() from sklearn.metrics import r2_score print(r2_score(y_test,y_pred))
That’s it. Now you have learned to implement Simple Linear Regression in Python. You can build a similar model for predicting missing data as well.
Also read: Handling Missing Numerical Data Using SimpleImputer