Summary of "Encoding Categorical Data | Ordinal Encoding | Label Encoding"
Encoding Categorical Data (Ordinal Encoding, Label Encoding)
Main ideas and concepts
- Categorical features must be converted to numeric values before using most machine learning algorithms.
- Two main categorical types:
- Nominal: categories have no order or relationship (e.g., state, gender, product category).
- Ordinal: categories have a meaningful order (e.g., education levels, review ratings).
- Appropriate encoding methods:
- Use
OrdinalEncoder(or a custom mapping) for ordinal features. - Use one-hot encoding for nominal features (covered in a separate video).
LabelEncoderis intended for target labels (y) in classification and is not generally appropriate for input features.
- Use
Important practical rules
Always split data into train and test before fitting encoders to avoid data leakage. Fit encoders on the training set, then transform both train and test with the fitted encoder.
- If you don’t explicitly tell
OrdinalEncoderthe category order, numeric labels may be assigned arbitrarily. - Use
ColumnTransformerandPipelineto apply different transformations to different columns in a clean, maintainable way.
Step-by-step methodology / instructions
- Identify categorical columns and their types
- Determine which features are nominal vs ordinal using domain knowledge.
- Example dataset columns:
gender(nominal)reviews(ordinal:poor/average/good)education(ordinal:school/undergraduate/postgraduate)purchased(yes/no — nominal target)
- Split data into train and test
- Use
train_test_split(or equivalent). - Always learn encoders only on the training set.
- Use
- Prepare
OrdinalEncoderfor ordinal features- Import
OrdinalEncoderfromsklearn.preprocessing(or implement a custom mapping). - When constructing the encoder, pass a
categoriesparameter that explicitly specifies the ordered categories for each ordinal column (a list of lists), for example:reviews:['poor', 'average', 'good']education:['school', 'undergraduate', 'postgraduate']
- Fit the encoder on
X_trainand transform bothX_trainandX_test. - Result: ordinal categories convert to integers reflecting the given order (e.g.,
poor → 0,average → 1,good → 2).
- Import
- Handle nominal features
- Use
OneHotEncoder(or another appropriate encoder) for nominal features to avoid introducing artificial order. - Apply
OneHotEncoderviaColumnTransformeror a pipeline for selected nominal columns.
- Use
- Encode the target variable (if needed)
- Use
LabelEncoderonly for target labels (y) in classification tasks. - Fit
LabelEncoderony_trainand transformy_trainandy_test. - Do not use
LabelEncoderfor input features unless you intentionally want a numeric mapping and understand the implications.
- Use
- Combine transformations cleanly
- Use
ColumnTransformerto assign different transformers (OrdinalEncoder,OneHotEncoder, etc.) to different subsets of columns. - Use
Pipelineto chain preprocessing and modeling steps and ensure reproducible transforms.
- Use
Examples & warnings highlighted
- Example mappings:
reviews(poor<average<good) → ordinal numbers.education(school<undergraduate<postgraduate) → ordinal numbers.
- If categories/order aren’t supplied to
OrdinalEncoder, category-to-number assignments can be arbitrary and inconsistent. - Historically some practitioners used
LabelEncoderon input features; this is discouraged because it can imply an order that doesn’t exist. - Always fit transformers on training data only, then apply to the test set.
Tools / classes mentioned
sklearn.preprocessing.OrdinalEncodersklearn.preprocessing.LabelEncodersklearn.model_selection.train_test_splitColumnTransformerPipelineOneHotEncoder
Speakers / sources featured
- Video presenter / YouTuber (unnamed) — sole speaker explaining concepts and demonstrating code examples.
- Referenced documentation/tools: scikit-learn classes (
OrdinalEncoder,LabelEncoder,train_test_split,ColumnTransformer,OneHotEncoder).
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...