Sklearn preprocessing libraries
Numerical Data – Imputer
Replace the missing data (NaNs) with
1. Zero or least impactful data
2. Mean of the column
3. Median of the column
4. Most frequently occuring value
Categorical Data – LabelEncoder
Ex: Yes / No
Germany – 0
France – 1
India – 2
First encode the categorical data to numerical data. But if you are considering data which aren’t implicitly different, then it is not wise just to encode the data.
In the above case, model shouldn’t think that India is greater than France and France is greater than Germany.
But it makes sense to assign just numbers when the categorical data does imply something. For ex: small, medium and large categories.
| Data | Encoding |
|---|---|
| Small | 0 |
| Medium | 1 |
| Large | 2 |
How to encode so that the numerical values doesn’t convey anything implicitly?
One-hot encoder
| One hot encoder | |||
|---|---|---|---|
| Germany | 1 | 0 | 0 |
| France | 0 | 1 | 0 |
| India | 0 | 0 | 1 |