Summary of "Handling missing data | Numerical Data | Simple Imputer"

Purpose and scope

The video explains methods to handle missing numerical data (univariate and multivariate imputation) with practical coding examples using the Titanic dataset. It focuses on simple imputation techniques for numerical columns and how to apply them in pandas and scikit-learn (SimpleImputer, ColumnTransformer, Pipeline). Two more advanced topics (random-sample imputation and automatic selection of the best imputation technique) are deferred to another video.


Key concepts introduced

Univariate vs multivariate imputation

Simple imputation strategies for numerical data

When to use mean vs median

Rationale for arbitrary / end-of-distribution imputation

Practical cautions


Practical implementation steps

Preparatory steps

Mean/median imputation (pandas / scikit-learn)

Compute replacement values on training data:

mean_val = train[col].mean()
median_val = train[col].median()

Pandas replacement (quick/simple):

df[col].fillna(mean_val)
df[col].fillna(median_val)

scikit-learn (recommended for pipelines/production):

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')  # or 'median'
imputer.fit(X_train[[col]])
X_train[col] = imputer.transform(X_train[[col]])
# then use the same imputer.transform on X_test

Applying different imputations per column (ColumnTransformer)

Define separate imputers and combine with ColumnTransformer:

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

num_imputer_mean = SimpleImputer(strategy='mean')
num_imputer_median = SimpleImputer(strategy='median')

ct = ColumnTransformer([
    ('mean_imp', num_imputer_mean, [colA]),
    ('median_imp', num_imputer_median, [colB])
], remainder='passthrough')

ct.fit(X_train)
X_train_trans = ct.transform(X_train)
X_test_trans = ct.transform(X_test)

Put the ColumnTransformer into a Pipeline together with an estimator for production.

Arbitrary / constant imputation

Use SimpleImputer with a constant value:

SimpleImputer(strategy='constant', fill_value=some_value)

Choose fill_value distinct from normal range if you want to flag missingness.

End-of-distribution (extreme-value) imputation

Choose an extreme value from the column tail:

Replace missing values with that extreme value so they appear as distinct outliers to the model.


Diagnostics / checks after imputation (always run)


Best practices emphasized


Formulas / rules discussed


What the speaker demonstrated


Limitations and pitfalls


Topics deferred to next video(s)


Speakers / sources featured


If useful, a ready-to-run scikit-learn snippet showing SimpleImputer + ColumnTransformer plus diagnostic checks (distribution plots, variance/correlation comparisons) can be provided, or a short decision checklist summarizing pros/cons of mean/median/constant/end-of-distribution imputation.

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video