Handling Missing Numerical Data Using SimpleImputer

handle missing numerical data using SimpleImputer

Read Time2 Minute, 18 Second

In this article, we will see how to handle missing numerical data using SimpleImputer. We will also see scaling, and data splitting.

If you don’t like to read, here’s the video on the same.

You can read all the steps or can get the code on GitHub.

Let’s start coding step by step.

Importing the dataset (created in last video) and libraries.

import pandas as pd
import numpy as np

dataset=pd.read_csv('op.csv')


X=dataset.iloc[:,:-1]
y=dataset.iloc[:,-1]

Replacing nan in numerical columns by mean value of the column values.

from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(X.iloc[:,2:4])
X.iloc[:,2:4]=imputer.transform(X.iloc[:,2:4])

Now the missing values are replaced by the mean of other values from the column.

Now we will do encoding.

As X has independent variables, we will OneHotEncode Country Column.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[1])],remainder='passthrough')
X=np.array(ct.fit_transform(X))

For france the values are 100, for Spain 001, and for Germany 010.

label encoding dependent variable as it has only two values yes and no.

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y=le.fit_transform(y)

Before this we can split the dataset into training and testing sets.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)

Depending on the algorithm, we should choose whether to scale the data features(columns) or not. Now we will scale. The reason is to prevent the data leakage from X_train to X_test. We splitted the data and then doing scaling.

#feature scaling
from sklearn.preprocessing import StandardScaler #there are multiple methods to scale such as Standard Scaling, Min Max scaling, etc
sc=StandardScaler()
X_train[:,3:]=sc.fit_transform(X_train[:,3:])
X_test[:,3:]=sc.transform(X_test[:,3:])

This is why we first splitted the dataset and then did fit_transform on training data and just transform it on a test set to prevent the leakage to the test set.

Now you can use any algorithm to do your task. Preprocessing ends here for this dataset.

from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, y_train)

y_pred=classifier.predict(X_test)

That’s it. Now you have learned to handle missing numerical data using SimpleImputer. You can build a similar model for predicting missing data as well.

Also read: Handling Missing Data In Categorical Column Using Classifier (With Code)