Summary of "Handling Missing Categorical Data | Simple Imputer | Most Frequent Imputation | Missing Category Imp"

Handling missing categorical data

Two primary strategies: 1. Most‑frequent (mode) imputation — replace missing entries with the most common category. 2. Missing‑category imputation — create a new category (e.g., “Missing”) and assign it to missing entries.

Main ideas

Use mode imputation when missingness is small and one category clearly dominates.
Use a dedicated “Missing” category when missingness is large or missingness may be informative (not missing at random).
Implementation examples shown with scikit‑learn’s SimpleImputer and pandas, and distributions compared before/after imputation (e.g., KDE plots of target by category).

Detailed lessons, rules and rationale

Numerical vs categorical

Numerical features: common imputations are mean or median (median often preferred when there are outliers or when missingness is small).
Categorical features: mean/median do not apply — use mode (most frequent) or an explicit missing category.

Most‑frequent (mode) imputation

When to use:
- Missing proportion is low.
- One category clearly dominates the column.
Pros:
- Simple to implement and scale.
- Works well when missing values are few and the dominant category reasonably represents them.
Cons:
- Inflates the dominant category and changes the distribution.
- Can bias the model.
- Performs poorly if categories are balanced or missingness is large.

Missing‑category imputation

What it is:
- Create an explicit new category (for example, “Missing”) and fill NA values with it.
When to use:
- Missingness is large.
- Missingness is likely informative (not missing at random).
Pros:
- Signals to the model that the value was missing.
- Simple to implement.
Cons:
- Adds a synthetic category that model treats as a real category.
- May not always be meaningful and can still produce mediocre results depending on context.

Other approaches (brief)

Random value imputation.
Sentinel numeric values (e.g., 99 or −1) historically used for numeric features — for categorical data, textual labels like “Missing” are preferred.
More advanced multivariate or model‑based imputations exist and can be used for complicated cases (not covered here).

Step‑by‑step methods (implementation notes)

Mode (most frequent) imputation — general steps

Inspect category frequencies to confirm a dominant category and low missing proportion.
Fit the imputer on training data only, then transform training/validation/test sets.
Validate by comparing distributions (e.g., KDE plots) of the target for original vs imputed categories.

Example with scikit‑learn:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(X_train[categorical_cols])
X_train[categorical_cols] = imputer.transform(X_train[categorical_cols])
# apply same transform to validation/test:
X_valid[categorical_cols] = imputer.transform(X_valid[categorical_cols])

Notes:

Avoid mode imputation when categories are balanced or missing rate is high.

Missing‑category imputation — general steps

Create a new label and fill NA values:
- pandas: X[col].fillna('Missing')
- scikit‑learn: SimpleImputer(strategy='constant', fill_value='Missing')
Fit models normally; the model will learn the effect of the “Missing” category.
Preferable when missingness is informative or missing rate is high.

Evaluation / diagnostics

Before imputation: compare the target distribution (e.g., sale price KDEs) for rows where the categorical feature is present vs missing.
After imputation: compare the target distribution for replaced values vs the original to see how much the imputation altered the distribution.
Use these diagnostics to decide which imputation method is appropriate for each column.

Demonstration summary (housing dataset example)

Data used: subset of a housing regression dataset (columns shown: fireplace quality, garage quality, sale price).

Garage quality
- Missing rate: low (~5%).
- One category dominated (mode = e.g., “GD”).
- Mode imputation produced little change in sale‑price distribution — acceptable here.
Fireplace quality
- Missing rate: high (~50%).
- Categories more balanced.
- Mode imputation over‑inflated a dominant category and changed the sale‑price distribution substantially — not appropriate.
- Recommendation: prefer “Missing” category or advanced imputation methods for this feature.
Code practices shown:
- Use SimpleImputer(strategy='most_frequent') for mode imputation.
- Use SimpleImputer(strategy='constant', fill_value='Missing') or pandas.fillna('Missing') to create a missing category.
- Integrate imputers into ColumnTransformer / pipelines to ensure consistent transforms for train/test.

Best practice takeaways

Choose imputation per column — no single method fits all situations:
- Low missingness + dominant category → mode imputation is OK.
- High missingness or missing not at random → create a “Missing” category (or use advanced methods).
Always inspect frequencies and the target distributions before and after imputation.
Fit imputers on training data only; apply the fitted transformer to validation/test sets.
Consider multivariate or model‑based imputation for complex cases.

Speakers / sources featured

Video presenter (unnamed — the channel’s author) presenting concepts and demonstration.
Data: well‑known housing regression dataset (Ames/Boston‑style).
Tools referenced: pandas, seaborn/matplotlib (KDE plots), scikit‑learn (SimpleImputer, ColumnTransformer, pipelines, train/test split).