Summary of "Power Transformer | Box - Cox Transform | Yeo - Johnson Transform"

Overview — main ideas and lessons

The video introduces the “power transformer” family of data transformations (Box–Cox and Yeo–Johnson) used to make feature distributions more Gaussian-like and to improve performance of algorithms that assume or benefit from near-normal inputs (for example, linear and logistic regression).

Key points:

Box–Cox and Yeo–Johnson are parametric, monotonic power transformations that include common transforms (log, square root, etc.) as special cases by adjusting a parameter λ (lambda). Lambda is estimated per feature to best approximate normality.
Box–Cox requires strictly positive data; Yeo–Johnson supports zero and negative values. If you need to use Box–Cox on data with zeros or negatives, shift the data by adding a small positive constant first.
Lambda values are typically chosen by maximum likelihood (or related estimation techniques).
scikit-learn’s PowerTransformer (with method='box-cox' or method='yeo-johnson') can be used to apply these transforms and (by default) standardize the transformed features to zero mean and unit variance.
Practical recommendation: check feature distributions; if skewed, try power transforms and compare downstream model performance. Pick the transform that yields the best validation metric.

If features are skewed and your model benefits from normal-ish inputs, try Box–Cox or Yeo–Johnson and validate which gives the best downstream results.

Key concepts / definitions

Power transformer: a family of parametric, monotonic transformations parameterized by λ that make data more Gaussian-like.
Box–Cox transform: a power transform requiring strictly positive inputs. Special cases include log, square root, etc. λ is estimated per feature.
Yeo–Johnson transform: variant that supports zero and negative values; otherwise serves the same purpose as Box–Cox.
Lambda (λ): the exponent/power parameter governing the transformation for each feature; estimated to optimize normality (often via maximum likelihood).
Standardization: PowerTransformer by default standardizes (zero mean, unit variance) after transforming (controlled by standardize=True).

Step-by-step methodology

Inspect your dataset
- Plot histograms or density plots for each feature to identify skewness/non-normality.
- Check for zeros or negative values (important for Box–Cox).
Choose transformation strategy
- If all values in a feature are strictly positive → Box–Cox is possible.
- If features contain zeros or negatives → use Yeo–Johnson, or shift the data by a small positive constant before Box–Cox.
Prepare scikit-learn transformer
- Import and create transformer objects: python from sklearn.preprocessing import PowerTransformer pt_box = PowerTransformer(method='box-cox', standardize=True) pt_yj = PowerTransformer(method='yeo-johnson', standardize=True)
- If using Box–Cox and a feature has zeros/min=0, add a small epsilon (e.g., 1e-6 or a domain-appropriate constant) to that feature before fitting.
Fit and transform
- Fit transformer on training features: python pt.fit(X_train)
- Transform training and test/validation sets: python X_train_t = pt.transform(X_train) X_test_t = pt.transform(X_test)
- PowerTransformer estimates a separate λ for each feature internally (accessible after fitting).
Train and evaluate downstream model
- Fit a model (e.g., LinearRegression) on transformed training data.
- Evaluate via cross-validation or a hold-out test set (apply the same transform pipeline before scoring).
- Compare metrics (R2, RMSE, etc.) against a model trained on untransformed features.
Compare transformations
- Try both Box–Cox and Yeo–Johnson (and possibly other transforms like log, sqrt).
- Compare model metrics and inspect post-transform feature distributions.
- Choose the transform that yields the best validation performance.
Additional practical tips
- Use Pipelines to ensure the same transform is applied to train and test splits.
- When using Box–Cox, store/record any small shift applied so you can invert or apply the same preprocessing to new data.
- If PowerTransformer.standardize=True, an additional StandardScaler after transform is usually unnecessary.

Implementation notes from the video

Dataset: concrete strength dataset with features such as cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, and age; target is concrete strength.
Tools used: sklearn.preprocessing.PowerTransformer, sklearn.linear_model.LinearRegression, and cross_val_score for evaluation.
The presenter printed per-feature λ values after fitting (e.g., cement ≈ 0.170; each feature has its own λ).
Visualizations: histograms before and after transformation to show improved normality for some features (not all features change equally).
Results: baseline regression score improved after applying Box–Cox and, in many cases, improved further with Yeo–Johnson. Exact numbers varied, but transforms often improved R²/performance.
Handling zeros: when a feature had min = 0, the presenter added a very small constant to allow Box–Cox.

Observations and recommendations

Power transforms are especially helpful in real-world tabular data where many features are skewed.
If your algorithm benefits from normality (linear or logistic regression, some feature-sensitive models), include power transformations in preprocessing experiments.
Use Yeo–Johnson when you need to handle zeros/negatives without shifting; use Box–Cox when all values are positive (it may perform slightly differently).
Always validate with cross-validation and compare transforms — choose the transform that yields the best downstream validation metric.
PowerTransformer often outperforms simple hand-picked transforms (log, sqrt), but try alternatives as well.

Speakers / sources featured

Presenter / YouTube channel host (unnamed) — instructor who explains concepts and runs the notebook demo.
scikit-learn (sklearn.preprocessing.PowerTransformer) — implementation used for Box–Cox and Yeo–Johnson transforms.