Summary of "Handling Missing Categorical Data | Simple Imputer | Most Frequent Imputation | Missing Category Imp"
Handling missing categorical data
Two primary strategies: 1. Most‑frequent (mode) imputation — replace missing entries with the most common category. 2. Missing‑category imputation — create a new category (e.g., “Missing”) and assign it to missing entries.
Main ideas
- Use mode imputation when missingness is small and one category clearly dominates.
- Use a dedicated “Missing” category when missingness is large or missingness may be informative (not missing at random).
- Implementation examples shown with scikit‑learn’s
SimpleImputerand pandas, and distributions compared before/after imputation (e.g., KDE plots of target by category).
Detailed lessons, rules and rationale
Numerical vs categorical
- Numerical features: common imputations are mean or median (median often preferred when there are outliers or when missingness is small).
- Categorical features: mean/median do not apply — use mode (most frequent) or an explicit missing category.
Most‑frequent (mode) imputation
- When to use:
- Missing proportion is low.
- One category clearly dominates the column.
- Pros:
- Simple to implement and scale.
- Works well when missing values are few and the dominant category reasonably represents them.
- Cons:
- Inflates the dominant category and changes the distribution.
- Can bias the model.
- Performs poorly if categories are balanced or missingness is large.
Missing‑category imputation
- What it is:
- Create an explicit new category (for example, “Missing”) and fill NA values with it.
- When to use:
- Missingness is large.
- Missingness is likely informative (not missing at random).
- Pros:
- Signals to the model that the value was missing.
- Simple to implement.
- Cons:
- Adds a synthetic category that model treats as a real category.
- May not always be meaningful and can still produce mediocre results depending on context.
Other approaches (brief)
- Random value imputation.
- Sentinel numeric values (e.g., 99 or −1) historically used for numeric features — for categorical data, textual labels like “Missing” are preferred.
- More advanced multivariate or model‑based imputations exist and can be used for complicated cases (not covered here).
Step‑by‑step methods (implementation notes)
Mode (most frequent) imputation — general steps
- Inspect category frequencies to confirm a dominant category and low missing proportion.
- Fit the imputer on training data only, then transform training/validation/test sets.
- Validate by comparing distributions (e.g., KDE plots) of the target for original vs imputed categories.
Example with scikit‑learn:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(X_train[categorical_cols])
X_train[categorical_cols] = imputer.transform(X_train[categorical_cols])
# apply same transform to validation/test:
X_valid[categorical_cols] = imputer.transform(X_valid[categorical_cols])
Notes:
- Avoid mode imputation when categories are balanced or missing rate is high.
Missing‑category imputation — general steps
- Create a new label and fill NA values:
- pandas:
X[col].fillna('Missing') - scikit‑learn:
SimpleImputer(strategy='constant', fill_value='Missing')
- pandas:
- Fit models normally; the model will learn the effect of the “Missing” category.
- Preferable when missingness is informative or missing rate is high.
Evaluation / diagnostics
- Before imputation: compare the target distribution (e.g., sale price KDEs) for rows where the categorical feature is present vs missing.
- After imputation: compare the target distribution for replaced values vs the original to see how much the imputation altered the distribution.
- Use these diagnostics to decide which imputation method is appropriate for each column.
Demonstration summary (housing dataset example)
Data used: subset of a housing regression dataset (columns shown: fireplace quality, garage quality, sale price).
-
Garage quality
- Missing rate: low (~5%).
- One category dominated (mode = e.g., “GD”).
- Mode imputation produced little change in sale‑price distribution — acceptable here.
-
Fireplace quality
- Missing rate: high (~50%).
- Categories more balanced.
- Mode imputation over‑inflated a dominant category and changed the sale‑price distribution substantially — not appropriate.
- Recommendation: prefer “Missing” category or advanced imputation methods for this feature.
-
Code practices shown:
- Use
SimpleImputer(strategy='most_frequent')for mode imputation. - Use
SimpleImputer(strategy='constant', fill_value='Missing')orpandas.fillna('Missing')to create a missing category. - Integrate imputers into
ColumnTransformer/ pipelines to ensure consistent transforms for train/test.
- Use
Best practice takeaways
- Choose imputation per column — no single method fits all situations:
- Low missingness + dominant category → mode imputation is OK.
- High missingness or missing not at random → create a “Missing” category (or use advanced methods).
- Always inspect frequencies and the target distributions before and after imputation.
- Fit imputers on training data only; apply the fitted transformer to validation/test sets.
- Consider multivariate or model‑based imputation for complex cases.
Speakers / sources featured
- Video presenter (unnamed — the channel’s author) presenting concepts and demonstration.
- Data: well‑known housing regression dataset (Ames/Boston‑style).
- Tools referenced: pandas, seaborn/matplotlib (KDE plots), scikit‑learn (
SimpleImputer,ColumnTransformer, pipelines, train/test split).
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...