# Handling Missing Data


Sklearn preprocessing libraries

Numerical Data – Imputer

Replace the missing data (NaNs) with
1. Zero or least impactful data
2. Mean of the column
3. Median of the column
4. Most frequently occuring value

Categorical Data – LabelEncoder

Ex: Yes / No

Germany – 0
France – 1
India – 2

First encode the categorical data to numerical data. But if you are considering data which aren’t implicitly different, then it is not wise just to encode the data.

In the above case, model shouldn’t think that India is greater than France and France is greater than Germany.

But it makes sense to assign just numbers when the categorical data does imply something. For ex: small, medium and large categories.

Data Encoding
Small 0
Medium 1
Large 2

How to encode so that the numerical values doesn’t convey anything implicitly?

One-hot encoder

One hot encoder
Germany 1 0 0
France 0 1 0
India 0 0 1

Leave a Reply

Your email address will not be published. Required fields are marked *