Summary of "One Hot Encoding | Handling Categorical Data | Day 27 | 100 Days of Machine Learning"

One-hot encoding (OHE) and handling nominal categorical data

Main ideas / concepts

Dummy variable trap / multicollinearity Creating one-hot columns for all categories produces columns that are linearly dependent (the dummy columns sum to 1 per row). This perfect multicollinearity can harm linear models (e.g., linear regression, logistic regression). The usual remedy is to drop one dummy column (use k-1 columns for k categories).

Implementation considerations for ML pipelines

Practical example (used in the video)

Dataset: used-car dataset with columns such as brand, kilometers_driven, fuel type, owner, and selling price (target).

Exploratory step:

Demonstrated methods:

  1. Pandas pd.get_dummies

    • Example: pd.get_dummies(df[[‘fuel’,’owner’]])
    • Use drop_first=True to remove one dummy per feature and avoid multicollinearity.
    • Quick and easy for small numbers of categories, but can expand dimensionality for high-cardinality features.
  2. sklearn OneHotEncoder

    • Fit OneHotEncoder on training features (fit_transform on training X) and transform the test set.
    • Recommended to use ColumnTransformer to apply OneHotEncoder to selected categorical columns while passing through numeric columns in one step.
    • After transformation, extract the resulting matrix, convert to a DataFrame (use get_feature_names_out), and concatenate with numeric columns.

High-cardinality handling in the example:

Step-by-step methodology (concise actionable list)

  1. Inspect categorical columns:
    • Use df[column].value_counts() to see unique categories and their frequencies.
  2. Decide encoding method:
    • Ordinal → ordinal encoding; Nominal → one-hot encoding.
  3. If using pandas:
    • pd.get_dummies(df[categorical_columns], drop_first=True) to create dummy columns and avoid the dummy trap.
  4. If using sklearn in a pipeline (recommended):
    • Split data into train and test.
    • Create OneHotEncoder(drop=’first’, sparse=False (or sparse_output=False), dtype=desired_dtype, handle_unknown=’ignore’ if needed).
    • Use ColumnTransformer to apply OneHotEncoder only to selected categorical columns and passthrough numeric columns.
    • On training data: column_transformer.fit_transform(X_train).
    • On test data: column_transformer.transform(X_test).
    • Convert transformed arrays back to DataFrame if you need column names, then concatenate with target/other features.
  5. Handle high-cardinality features:
    • Replace rare categories by grouping them into “Other”/”Rare” based on a frequency threshold or keep top-k frequent categories and map the rest to “Other”.
    • Then apply one-hot encoding to the reduced set of categories.
  6. Avoid the dummy variable trap:
    • Use drop=’first’ in OneHotEncoder or drop_first=True in pd.get_dummies so you get k-1 dummies for k categories.
  7. Persist transformations:
    • Save the fitted encoder/ColumnTransformer or the full pipeline so the same mapping is used at inference time.

Tips and cautions

Speakers / sources

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video