Summary of "Handling missing data | Numerical Data | Simple Imputer"
Purpose and scope
The video explains methods to handle missing numerical data (univariate and multivariate imputation) with practical coding examples using the Titanic dataset. It focuses on simple imputation techniques for numerical columns and how to apply them in pandas and scikit-learn (SimpleImputer, ColumnTransformer, Pipeline). Two more advanced topics (random-sample imputation and automatic selection of the best imputation technique) are deferred to another video.
Key concepts introduced
Univariate vs multivariate imputation
- Univariate: impute a missing value in a column using only information from that same column (e.g., mean/median of that column).
- Multivariate: use information from multiple columns to impute (not deeply covered in this video; mentioned as a different class of methods).
Simple imputation strategies for numerical data
- Mean imputation
- Median imputation
- Arbitrary (constant) imputation — replace missing values with a chosen constant
- End-of-distribution / extreme-value imputation — replace missing values with an extreme value taken from the tail of the distribution
- Random-sample imputation — mentioned and deferred to the next video
When to use mean vs median
- Mean: use if the column is approximately normally distributed and missingness is at random.
- Median: preferred if the distribution is skewed or contains outliers (median is more robust).
Rationale for arbitrary / end-of-distribution imputation
- Use when you want to explicitly mark missingness so the model can detect “missingness as signal” (especially for missing-not-at-random).
- End-of-distribution: pick a value from the tail (e.g., mean ± 3⋅std or beyond typical outlier thresholds) so it stands apart from normal values.
Practical cautions
- Simple imputations are easy but change the distribution, reduce variance, and can alter relationships (correlations/covariances) with other features.
- If a column has many missing values (> ~5–10%), mean/median imputation becomes less reliable for production.
- Always fit imputation parameters on the training set only and apply the same transform to validation/test sets.
Practical implementation steps
Preparatory steps
- Inspect the dataset and quantify missingness per column (percent missing).
- Decide whether missingness is roughly MCAR (missing completely at random), MAR (missing at random), or MNAR (missing not at random). This affects choice of strategy.
Mean/median imputation (pandas / scikit-learn)
Compute replacement values on training data:
mean_val = train[col].mean()
median_val = train[col].median()
Pandas replacement (quick/simple):
df[col].fillna(mean_val)
df[col].fillna(median_val)
scikit-learn (recommended for pipelines/production):
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean') # or 'median'
imputer.fit(X_train[[col]])
X_train[col] = imputer.transform(X_train[[col]])
# then use the same imputer.transform on X_test
Applying different imputations per column (ColumnTransformer)
Define separate imputers and combine with ColumnTransformer:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
num_imputer_mean = SimpleImputer(strategy='mean')
num_imputer_median = SimpleImputer(strategy='median')
ct = ColumnTransformer([
('mean_imp', num_imputer_mean, [colA]),
('median_imp', num_imputer_median, [colB])
], remainder='passthrough')
ct.fit(X_train)
X_train_trans = ct.transform(X_train)
X_test_trans = ct.transform(X_test)
Put the ColumnTransformer into a Pipeline together with an estimator for production.
Arbitrary / constant imputation
Use SimpleImputer with a constant value:
SimpleImputer(strategy='constant', fill_value=some_value)
Choose fill_value distinct from normal range if you want to flag missingness.
End-of-distribution (extreme-value) imputation
Choose an extreme value from the column tail:
- Option A (approx. normal): replacement ≈ mean ± 3 * standard_deviation
- Option B (skewed data): use IQR-based fences (median ± 1.5⋅IQR) and pick a value beyond these fences
Replace missing values with that extreme value so they appear as distinct outliers to the model.
Diagnostics / checks after imputation (always run)
- Compare distributions before vs after imputation (overlay histograms or density plots). If they diverge drastically, re-evaluate the strategy.
- Check variance changes: mean/median tends to reduce variance; extreme- or constant-imputation inflates variance.
- Compare boxplots to see effects on ranges/outliers.
- Recompute correlations/covariances between the imputed column and other features to detect relation changes.
- If relations change significantly, try different imputation methods or multivariate approaches.
Best practices emphasized
- Fit imputers on training data only; then transform validation/test sets.
- Use scikit-learn components (SimpleImputer, ColumnTransformer, Pipeline) for reproducibility and deployment.
- Prefer mean/median for small amounts of MCAR/MAR missingness and when distributional distortion is acceptable.
- Use arbitrary/end-of-distribution when missingness itself is informative (MNAR) and you want to flag it.
- If missingness proportion is large, consider more sophisticated methods (multivariate imputation, model-based imputation).
- Visualize and quantify the effect of imputation on distributional shape, variance, and correlations before finalizing.
Formulas / rules discussed
- Extreme normal-based choice:
- replacement ≈ mean ± 3 * standard_deviation
- IQR rule for outliers:
- lower fence = Q1 − 1.5⋅IQR
- upper fence = Q3 + 1.5⋅IQR
- choose a value beyond these fences to mark missingness
What the speaker demonstrated
- Example dataset: Titanic-derived numeric features (age, fare, family size, etc.).
- Intentionally removed ~5% of values to demonstrate imputation.
- Showed both pandas-style replacement and scikit-learn SimpleImputer + ColumnTransformer approaches.
- Compared resulting statistics (mean, variance), distribution plots, and correlations before/after imputation.
- Showed how to apply different strategies to different columns using ColumnTransformer.
- Discussed indicator features (briefly) to record presence/absence of original missingness.
Limitations and pitfalls
- Mean/median changes shape of distribution, reduces variance, and can alter correlations — possibly degrading model performance in production if many values are missing.
- Arbitrary / constant imputation also changes distribution and covariance structure; use only when marking missingness is desired.
- Choosing an arbitrary extreme value requires care — picking an unrealistic number may harm model behavior unless marking missingness is the intent.
- Simple techniques are easy and fast but less reliable for high missingness rates; consider multivariate or model-based imputation then.
Topics deferred to next video(s)
- Random-sample imputation (draw random observed values to fill missing entries).
- How to automatically select the best imputation technique (model selection among imputers).
Speakers / sources featured
- Primary speaker: the YouTuber presenting the lesson (unnamed in transcript).
- Mentioned name in subtitles: Vijay Arora (context unclear).
- Tools / libraries referenced: scikit-learn (SimpleImputer, ColumnTransformer, Pipeline) and pandas.
- Example dataset: Titanic dataset.
If useful, a ready-to-run scikit-learn snippet showing SimpleImputer + ColumnTransformer plus diagnostic checks (distribution plots, variance/correlation comparisons) can be provided, or a short decision checklist summarizing pros/cons of mean/median/constant/end-of-distribution imputation.
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.