Summary of "Handling Missing Data | Part 1 | Complete Case Analysis"
Overview
This video is the first in a planned 5‑video mini‑series on handling missing data and feature engineering. It introduces two broad strategies for dealing with missing values:
- Remove rows with missing values (Complete Case Analysis, CCA, also called listwise deletion).
- Impute missing values — either univariate (one column at a time) or multivariate (use other variables / ML models).
This episode focuses entirely on Complete Case Analysis (CCA): what it is, when to use it, how to apply it, pros and cons, and a short practical demo on a real dataset.
Key concepts and lessons
What is Complete Case Analysis (CCA)
- Also called listwise deletion.
- Drop any observation (row) that has a missing value in any of the selected columns. After CCA you work only on rows that are “complete” (no missing entries in the chosen fields).
Two imputation families (overview; covered in later videos)
- Univariate imputation: fill one column at a time (e.g., mean/median/mode, random value, end‑of‑distribution). (Will be demonstrated with SimpleImputer.)
- Multivariate imputation: fill values using other variables or ML models (iterative methods, model‑based imputers). (Will be covered later; classes/tools will be shown.)
Missingness mechanism matters
CCA is appropriate when data are Missing Completely At Random (MCAR) — the missingness is random with respect to values and other variables. If missingness is not MCAR (e.g., Missing At Random (MAR) or Missing Not At Random (MNAR)), CCA can bias results and distort inferences.
Important caveat for production: if you train a model after dropping missing rows, the model never learned how to handle missing values. In production, if inputs contain missing values, the model may fail or behave poorly. This is a major reason many practitioners favor imputation methods over wholesale deletion.
Heuristics / practical rules of thumb
- Prefer CCA only when the fraction of missing data in chosen columns is small. A common practical threshold is ~5% missing.
- If a column is almost entirely missing (e.g., ~98% missing), drop the column instead of trying to impute or using CCA.
- Always compare distributions before and after CCA to check whether CCA changed the data distribution (indicating non‑random missingness / bias).
Detailed procedural checklist (how to evaluate and apply CCA)
- Inspect missingness:
- Compute missingness (%) per column.
- Identify candidate columns for CCA (those with small missing fraction).
- Choose which columns to include in the CCA (i.e., which columns’ missing entries will cause row deletion). Typically select only the columns with small missingness that you are comfortable dropping rows for.
- Apply CCA:
- Drop rows that have missing values in the chosen columns (for example, in Pandas:
DataFrame.dropna(subset=[...])).
- Drop rows that have missing values in the chosen columns (for example, in Pandas:
- Quantify data loss:
- Compare the remaining row count to the original dataset size (report % retained / % removed).
- Compare distributions before vs after deletion:
- Numerical: plot histograms / density / PDFs before and after deletion and check for substantial changes.
- Categorical: compare category proportions before and after deletion; proportions should remain roughly stable.
- Decision rule:
- If distributions remain similar and missingness is plausibly MCAR, CCA is acceptable.
- If distributions shift significantly or missingness has a pattern, do not use CCA — prefer imputation (univariate or multivariate).
- Production consideration:
- If you plan to deploy the model in production, ensure the pipeline either prevents missing inputs or implements the same handling/imputation logic; do not rely solely on CCA during training unless inputs are guaranteed complete in production.
- If a column is extremely sparse (very high missing%), drop the column instead of using CCA or imputation.
Practical example from the video (job application dataset)
Dataset features mentioned
enrolment_id, city, city_development_index, gender, relevant_experience, enrolled_university, education_level, major_discipline, experience (years), company_size, company_type, training_hours, target (hired or not).
Observed missingness (approximate)
- gender: ~23% missing
- major_discipline, company_size, company_type: ~30% missing
- city_development_index, enrolled_university, education_level: ~2% missing
- experience: very low missingness
- training_hours: ~4% missing
Decision in the demo
- Select columns with <5% missingness for CCA: city_development_index, enrolled_university, education_level, experience, training_hours.
- Drop rows missing any of these selected columns (CCA), reducing dataset size.
- Compare histograms/density plots for numerical features (training_hours, city_development_index, experience) before and after the deletion — distributions largely overlap, indicating missingness is likely MCAR for these features.
- For categorical variables (enrolled_university and education_level), compare category proportion changes before and after; proportions stayed fairly stable, supporting use of CCA here.
- Conclusion from demo: CCA was reasonable for these chosen columns because missingness appeared random and the distribution did not change materially after deletion.
Warning: do not run CCA on columns with large missing fractions (e.g., gender, company_size, company_type in this dataset) because that would remove a large portion of observations.
Advantages of Complete Case Analysis
- Very simple and fast to apply (just drop rows).
- No complex imputation procedure required.
- If missingness is MCAR, CCA preserves the original distribution and should not introduce bias.
Disadvantages and risks
- Can remove a large fraction of the dataset, losing statistical power.
- If missingness is not MCAR, CCA will bias the dataset and model results.
- Models trained after CCA never learn to handle missing inputs — a major problem for production where missing data may appear.
- Not suitable when missingness exceeds practical thresholds or when the missingness mechanism is informative.
Series plan (future videos)
- Part 2: Univariate/simple imputation methods (SimpleImputer and techniques like mean/median/mode/random/end‑of‑distribution).
- Part 3: Multivariate imputation — filling using ML algorithms (model‑based imputers, iterative imputation methods).
- Later videos: other imputation methods and use of missing indicator features.
- The speaker mentions specific imputer classes will be demoed (SimpleImputer and classes for multivariate imputation), though some class names in the subtitles are garbled.
Speakers / sources
- Single speaker: the YouTube channel host / instructor (unnamed in the subtitles).
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.