Summary of "Missing Indicator | Random Sample Imputation | Handling Missing Data Part 4"
Overview
This video (part 4 of a short series on handling missing data) covers three topics:
- Random-sample imputation (applies to numeric and categorical features)
- Missing-indicator features (binary flags for missingness)
- Automatic selection of imputation strategy using GridSearchCV + pipelines
Two example datasets are used to demonstrate the concepts:
- Titanic dataset — Age, Fare, Survived (Age has ≈20% missing)
- House-prices dataset — GarageQual, FireplaceQual (some features have large missingness, e.g. ~50%)
1) Random-sample imputation (random value imputation)
Idea
For each missing value in a column, pick a value at random from the observed (non-missing) values in that same column and use it to fill the missing entry. Works for both numerical and categorical features.
Why use it
- Preserves the original marginal distribution of the feature much better than mean/median imputation (you sample from the observed values).
- Often recommended when you plan to use linear models (the video favors random imputation more for linear algorithms).
How to implement (practical steps)
- Extract observed values:
df[col].dropna(). - Sample as many values as the number of missing entries:
df.sample(n=number_of_missing). - Assign sampled values into missing positions:
df.loc[df[col].isna(), col] = sampled_values.values. - Repeat consistently for train and test sets.
Production / determinism tip
If sampling at prediction time, you may get inconsistent outputs for the same input. To avoid this, derive a deterministic seed from the input (e.g., a hash of the row) so the same missing input always maps to the same sampled value.
Advantages
- Maintains the feature distribution; histograms before/after look similar.
Disadvantages / cautions
- Breaks or interferes with relationships between that feature and other features (injects random noise into dependency structure).
- May be inappropriate when a column has very high missingness (e.g., ~50%) — sampling can change category frequencies and introduce bias.
- Deployment requires access to the pool of observed training values; storing a large training set on the server can be memory-heavy.
- Not always ideal for tree-based models — trees may be less compatible with this randomization approach.
2) Missing-indicator (missingness flag)
Idea
For any feature with missing values, create a new binary column that marks whether the original value was missing (True/1) or not (False/0). Then impute the original column and keep the indicator as an additional feature.
Why it helps
- The model can learn that missingness itself carries information; patterns of missingness may be predictive.
- In practice and competitions, adding missing indicators sometimes yields measurable performance gains.
How to implement (practical steps)
- For each column with missing data:
- Create an indicator:
df[col + '_missing'] = df[col].isna(). - Impute the original column (e.g.,
SimpleImputer(strategy='mean')). - Train the model using both the imputed column(s) and the indicator column(s).
- Create an indicator:
- In scikit-learn:
- Use
sklearn.impute.MissingIndicatorto generate indicators during preprocessing. - Some pipelines or
SimpleImputeroptions let you enable indicators directly (or useColumnTransformerto add them explicitly).
- Use
Example
On the Titanic dataset, adding an Age_missing flag, then imputing Age with the mean and fitting logistic regression increased performance by ~2% in the toy example shown.
Caveats
- Not guaranteed to help in all problems — try it when models stall.
- Use indicators consistently across train and test.
3) Automatic selection of imputation strategy via GridSearchCV (pipeline + parameter search)
Goal
Let cross-validation determine the best imputation strategies (and other preprocessing choices) jointly with the model hyperparameters.
Approach (step-by-step)
- Build preprocessing pipelines for numerical and categorical columns:
- Numeric pipeline example: imputer (
strategy = mean / median / constant) → scaler (StandardScaler). - Categorical pipeline example: imputer (
strategy = most_frequent / constant) → encoding or pass-through.
- Numeric pipeline example: imputer (
- Combine numeric and categorical pipelines with
sklearn.compose.ColumnTransformer. - Create a final
Pipelinewith theColumnTransformerstep followed by an estimator (e.g.,LogisticRegression). - Define a parameter grid including imputation strategy choices and estimator hyperparameters. Example:
numeric__imputer__strategy: ['mean', 'median']categorical__imputer__strategy: ['most_frequent', 'constant']logistic__C: [0.1, 1, 10](optional)
- Run
GridSearchCV(orRandomizedSearchCVfor larger spaces) on the pipeline with cross-validation. - Fit on training data and inspect
grid_search.best_params_.
Notes & cautions
- Grid search can be slow for many combinations; use
RandomizedSearchCVor narrow the grid for large problems. - Best strategies depend on data and transforms — results vary by dataset.
- Use pipelines and
ColumnTransformerso the same preprocessing is applied in CV splits and at prediction time.
Practical tips and takeaways
- Random-sample imputation is easy to code with pandas and preserves marginal distributions, but be careful about relationship distortion and deployment reproducibility.
- For categorical columns with very high missingness, random sampling can substantially change category frequencies — consider alternatives.
- Adding missing-indicator features is a quick trick that can yield gains; try it if performance is stagnant.
- Use
sklearnPipelines +ColumnTransformerto keep preprocessing consistent and CV-compatible. - Use
GridSearchCVorRandomizedSearchCVto select imputation strategies jointly with model hyperparameters instead of guessing.
Code / library references (informal)
- pandas:
df.sample,df.isna(),df.loc[...]assignment - scikit-learn:
sklearn.impute.MissingIndicator,sklearn.impute.SimpleImputer,sklearn.pipeline.Pipeline,sklearn.compose.ColumnTransformer,sklearn.model_selection.GridSearchCV,LogisticRegression,StandardScaler
Datasets used in demonstrations
- Titanic dataset — Age imputation + logistic regression
- House-prices dataset — GarageQual, FireplaceQual; demonstration of categorical missingness distortion
Pros and cons summary
-
Random-sample imputation
- Pros: preserves marginal distribution; simple
- Cons: breaks feature relationships; requires training values at deployment; can distort category frequencies with heavy missingness; potentially less suitable for tree-based models
-
Missing-indicator
- Pros: lets model exploit informative missingness; easy to add; can improve performance
- Cons: not always helpful; increases feature count
-
Grid-search for imputation
- Pros: finds best imputation strategies automatically in a CV-aware way; tests strategies jointly with model settings
- Cons: can be computationally expensive
Speakers / sources
- Single instructor / YouTuber presenting explanations and coding examples
- Datasets used as examples: Titanic and House-prices (Kaggle)
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.