Handling Missing Data In Categorical Column Using Classifier (With Code)

Handling Missing Data In Categorical Column Using Classifier

Read Time2 Minute, 12 Second

In this article, we are going to see how we can use a classification model to predict a missing categorical value. We will use SVM classifier for handling the missing categorical data in the column.

If you don’t like to read, here’s the video on the same.

You can read all the steps or can get the code on GitHub.

Let’s start coding step by step.

Handling Missing Data In Categorical Column Using Classifier

Step1: Importing the libraries

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

Step2: Loading the dataset

dataset=pd.read_csv('C:\\Users\\shubh\\Desktop\\dp.csv')

Our dataset looks like this:

Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
,50,83000,No
France,37,67000,Yes

Step3: taking out nan value rows from dataset and putting in another file. Also we will drop the nan value rows from the original dataset.

temp_test=dataset[dataset['Country'].isnull()]
dataset=dataset.dropna()
X=dataset.iloc[:,1:4]
y=dataset.iloc[:,0]

Now X will have Age, Salary, and Purchased columns. And y will have the dependent variable Country.

Step4: Now we will encode non-numeric data column using label encoding

le=LabelEncoder()
X.iloc[:,-1]=le.fit_transform(X.iloc[:,-1])

Step5: Deleting nan Country from the temp_test and encoding the non numeric value in temp_test.

del temp_test['Country']
temp_test.iloc[:,-1]=le.fit_transform(temp_test.iloc[:,-1])

Step6: As SVM uses Euclidean distance for its calculations, we have to scale the data.

sc = StandardScaler()
X = sc.fit_transform(X)
temp_test = sc.transform(temp_test)
print(X)
print(temp_test)

Step7: After scaling the values, we will train the SVM model.

classifier=SVC(kernel='rbf',random_state=0)
classifier.fit(X,y)

Step8: We will predict the Country for the data in temp_test and then insert into the original dataset.

y_pred=classifier.predict(temp_test)
dataset=pd.read_csv('C:\\Users\\shubh\\Desktop\\dp.csv')

dataset.loc[dataset.Country.isnull(),'Country']=y_pred

Step9: If you want to save the file and work on it later you can use to_csv() function.

dataset.to_csv('cat_processed.csv')

That’s it. Now you can use imputer for handling numerical data or you can build the similar model for predicting missing data as well. We will see Imputer in different article [on this data].

Also Read: Top 5 Trending Technologies to Master in 2021 [Technology Trends]