Summary of "One Hot Encoding | Handling Categorical Data | Day 27 | 100 Days of Machine Learning"

One-hot encoding (OHE) and handling nominal categorical data

Main ideas / concepts

Most machine learning algorithms require numeric inputs, so categorical data must be encoded numerically.
Two broad types of categorical variables:
- Ordinal: categories with an intrinsic order — use ordinal encoding.
- Nominal: categories with no order — use one-hot encoding (OHE).
One-hot encoding converts each category into a separate binary column (a vector representation).

Dummy variable trap / multicollinearity Creating one-hot columns for all categories produces columns that are linearly dependent (the dummy columns sum to 1 per row). This perfect multicollinearity can harm linear models (e.g., linear regression, logistic regression). The usual remedy is to drop one dummy column (use k-1 columns for k categories).

High-cardinality (many distinct categories) problem:
- OHE can create a large number of columns when a feature has many unique values.
- Practical strategy: keep only the most frequent categories and group the rest into an “Other” (or “Rare”) category to reduce dimensionality.

Implementation considerations for ML pipelines

Fit encoders only on training data: use fit_transform on training and transform on test/new data (do not re-fit on test).
Use sklearn.preprocessing.OneHotEncoder together with sklearn.compose.ColumnTransformer to apply encoding only to selected columns and leave others untouched.
Important OneHotEncoder parameters:
- drop=’first’ (or drop_first in pandas) to avoid the dummy trap.
- sparse=False or sparse_output=False to obtain dense arrays (depending on sklearn version).
- dtype to control column data type.
- handle_unknown=’ignore’ to handle unseen categories in new data.
Convert the encoded array back to a DataFrame when you need column names, and concatenate with numeric columns.
Persist the fitted encoder or the whole pipeline so you can apply identical transformations at prediction time.

Practical example (used in the video)

Dataset: used-car dataset with columns such as brand, kilometers_driven, fuel type, owner, and selling price (target).

Exploratory step:

Use df[column].value_counts() to inspect categorical columns and see category frequencies.

Demonstrated methods:

Pandas pd.get_dummies
- Example: pd.get_dummies(df[[‘fuel’,’owner’]])
- Use drop_first=True to remove one dummy per feature and avoid multicollinearity.
- Quick and easy for small numbers of categories, but can expand dimensionality for high-cardinality features.
sklearn OneHotEncoder
- Fit OneHotEncoder on training features (fit_transform on training X) and transform the test set.
- Recommended to use ColumnTransformer to apply OneHotEncoder to selected categorical columns while passing through numeric columns in one step.
- After transformation, extract the resulting matrix, convert to a DataFrame (use get_feature_names_out), and concatenate with numeric columns.

High-cardinality handling in the example:

Compute value_counts for brand.
Replace brands with low counts (e.g., < 100) with a single category label like “Other”.
Then apply one-hot encoding to the reduced set of categories to reduce the number of dummy columns and speed processing.

Step-by-step methodology (concise actionable list)

Inspect categorical columns:
- Use df[column].value_counts() to see unique categories and their frequencies.
Decide encoding method:
- Ordinal → ordinal encoding; Nominal → one-hot encoding.
If using pandas:
- pd.get_dummies(df[categorical_columns], drop_first=True) to create dummy columns and avoid the dummy trap.
If using sklearn in a pipeline (recommended):
- Split data into train and test.
- Create OneHotEncoder(drop=’first’, sparse=False (or sparse_output=False), dtype=desired_dtype, handle_unknown=’ignore’ if needed).
- Use ColumnTransformer to apply OneHotEncoder only to selected categorical columns and passthrough numeric columns.
- On training data: column_transformer.fit_transform(X_train).
- On test data: column_transformer.transform(X_test).
- Convert transformed arrays back to DataFrame if you need column names, then concatenate with target/other features.
Handle high-cardinality features:
- Replace rare categories by grouping them into “Other”/”Rare” based on a frequency threshold or keep top-k frequent categories and map the rest to “Other”.
- Then apply one-hot encoding to the reduced set of categories.
Avoid the dummy variable trap:
- Use drop=’first’ in OneHotEncoder or drop_first=True in pd.get_dummies so you get k-1 dummies for k categories.
Persist transformations:
- Save the fitted encoder/ColumnTransformer or the full pipeline so the same mapping is used at inference time.

Tips and cautions

Always fit encoders only on training data and apply the learned mapping to test/new data.
When combining encoded arrays with numeric columns, track column order and names (use get_feature_names_out or construct names manually).
ColumnTransformer simplifies applying different transformers to different columns and integrates cleanly with sklearn pipelines.
Grouping rare categories reduces dimensionality but may discard information — choose thresholds and strategies based on domain knowledge and experimentation.