Summary of "Missing Indicator | Random Sample Imputation | Handling Missing Data Part 4"

Overview

This video (part 4 of a short series on handling missing data) covers three topics:

  1. Random-sample imputation (applies to numeric and categorical features)
  2. Missing-indicator features (binary flags for missingness)
  3. Automatic selection of imputation strategy using GridSearchCV + pipelines

Two example datasets are used to demonstrate the concepts:


1) Random-sample imputation (random value imputation)

Idea

For each missing value in a column, pick a value at random from the observed (non-missing) values in that same column and use it to fill the missing entry. Works for both numerical and categorical features.

Why use it

How to implement (practical steps)

Production / determinism tip

If sampling at prediction time, you may get inconsistent outputs for the same input. To avoid this, derive a deterministic seed from the input (e.g., a hash of the row) so the same missing input always maps to the same sampled value.

Advantages

Disadvantages / cautions


2) Missing-indicator (missingness flag)

Idea

For any feature with missing values, create a new binary column that marks whether the original value was missing (True/1) or not (False/0). Then impute the original column and keep the indicator as an additional feature.

Why it helps

How to implement (practical steps)

Example

On the Titanic dataset, adding an Age_missing flag, then imputing Age with the mean and fitting logistic regression increased performance by ~2% in the toy example shown.

Caveats


3) Automatic selection of imputation strategy via GridSearchCV (pipeline + parameter search)

Goal

Let cross-validation determine the best imputation strategies (and other preprocessing choices) jointly with the model hyperparameters.

Approach (step-by-step)

  1. Build preprocessing pipelines for numerical and categorical columns:
    • Numeric pipeline example: imputer (strategy = mean / median / constant) → scaler (StandardScaler).
    • Categorical pipeline example: imputer (strategy = most_frequent / constant) → encoding or pass-through.
  2. Combine numeric and categorical pipelines with sklearn.compose.ColumnTransformer.
  3. Create a final Pipeline with the ColumnTransformer step followed by an estimator (e.g., LogisticRegression).
  4. Define a parameter grid including imputation strategy choices and estimator hyperparameters. Example:
    • numeric__imputer__strategy: ['mean', 'median']
    • categorical__imputer__strategy: ['most_frequent', 'constant']
    • logistic__C: [0.1, 1, 10] (optional)
  5. Run GridSearchCV (or RandomizedSearchCV for larger spaces) on the pipeline with cross-validation.
  6. Fit on training data and inspect grid_search.best_params_.

Notes & cautions


Practical tips and takeaways


Code / library references (informal)


Datasets used in demonstrations


Pros and cons summary


Speakers / sources

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video