Summary of "Handling Missing Data | Part 1 | Complete Case Analysis"

Overview

This video is the first in a planned 5‑video mini‑series on handling missing data and feature engineering. It introduces two broad strategies for dealing with missing values:

This episode focuses entirely on Complete Case Analysis (CCA): what it is, when to use it, how to apply it, pros and cons, and a short practical demo on a real dataset.

Key concepts and lessons

What is Complete Case Analysis (CCA)

Two imputation families (overview; covered in later videos)

Missingness mechanism matters

CCA is appropriate when data are Missing Completely At Random (MCAR) — the missingness is random with respect to values and other variables. If missingness is not MCAR (e.g., Missing At Random (MAR) or Missing Not At Random (MNAR)), CCA can bias results and distort inferences.

Important caveat for production: if you train a model after dropping missing rows, the model never learned how to handle missing values. In production, if inputs contain missing values, the model may fail or behave poorly. This is a major reason many practitioners favor imputation methods over wholesale deletion.

Heuristics / practical rules of thumb

Detailed procedural checklist (how to evaluate and apply CCA)

  1. Inspect missingness:
    • Compute missingness (%) per column.
    • Identify candidate columns for CCA (those with small missing fraction).
  2. Choose which columns to include in the CCA (i.e., which columns’ missing entries will cause row deletion). Typically select only the columns with small missingness that you are comfortable dropping rows for.
  3. Apply CCA:
    • Drop rows that have missing values in the chosen columns (for example, in Pandas: DataFrame.dropna(subset=[...])).
  4. Quantify data loss:
    • Compare the remaining row count to the original dataset size (report % retained / % removed).
  5. Compare distributions before vs after deletion:
    • Numerical: plot histograms / density / PDFs before and after deletion and check for substantial changes.
    • Categorical: compare category proportions before and after deletion; proportions should remain roughly stable.
  6. Decision rule:
    • If distributions remain similar and missingness is plausibly MCAR, CCA is acceptable.
    • If distributions shift significantly or missingness has a pattern, do not use CCA — prefer imputation (univariate or multivariate).
  7. Production consideration:
    • If you plan to deploy the model in production, ensure the pipeline either prevents missing inputs or implements the same handling/imputation logic; do not rely solely on CCA during training unless inputs are guaranteed complete in production.
  8. If a column is extremely sparse (very high missing%), drop the column instead of using CCA or imputation.

Practical example from the video (job application dataset)

Dataset features mentioned

enrolment_id, city, city_development_index, gender, relevant_experience, enrolled_university, education_level, major_discipline, experience (years), company_size, company_type, training_hours, target (hired or not).

Observed missingness (approximate)

Decision in the demo

Warning: do not run CCA on columns with large missing fractions (e.g., gender, company_size, company_type in this dataset) because that would remove a large portion of observations.

Advantages of Complete Case Analysis

Disadvantages and risks

Series plan (future videos)

Speakers / sources

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video