Summary of "Function Transformer | Log Transform | Reciprocal Transform | Square Root Transform"
What this video covers
- Introduction to mathematical feature transformations (feature engineering) that change numeric column distributions toward normality to help certain ML algorithms.
- Focus on sklearn.preprocessing.FunctionTransformer (this video) and a preview of sklearn.preprocessing.PowerTransformer (next video), which implements Box‑Cox and Yeo‑Johnson.
- Demonstration using the Titanic dataset (Age, Fare → predict Survived), with code examples using pandas, numpy/math, matplotlib/seaborn and scikit‑learn (FunctionTransformer, ColumnTransformer, LogisticRegression, DecisionTreeClassifier, cross_val_score).
Why transform numeric features
Many statistical and linear models (e.g., linear regression, logistic regression) assume or perform better with approximately normally distributed features. Transformations can:
- Reduce skew and compress large values.
- Convert multiplicative relationships into additive ones (for example, log turns multiplicative scale into additive), which can improve linear-model performance.
- Be unnecessary for tree-based models (random forest, decision trees), which are generally insensitive to distribution shape.
Transforms explained
-
Log transform
- Most common for right‑skewed data (large outliers, wide range).
- Cannot be applied to negative values; use log1p (log(x+1)) if zeros are present.
- Effect: compresses large values, reduces skew — can improve linear models.
-
Reciprocal transform (1/x)
- Inverts scale: large values become small and vice versa.
- Can heavily change relationships; use with caution.
-
Power transforms (square, square root)
- Square: increases emphasis on large values (less commonly helpful for skew correction).
- Square root: mild variance stabilization; sometimes helpful but not always effective.
-
More advanced transforms (covered in the next video)
- Box‑Cox and Yeo‑Johnson (implemented by PowerTransformer) — often effective for making data more normal.
- Johnson transform and other custom transforms (briefly mentioned).
How to identify skew / non‑normality
- Visual inspection: histogram / PDF / density plot.
- Summary statistics: skewness and kurtosis.
- Q–Q plot (probability plot): the most reliable visual. If points follow the straight line, data is close to normal. Deviations on one side indicate skew; deviations at both ends indicate heavy tails.
Practical implementation steps
- Inspect your column distributions (histogram, Q–Q plot, skewness).
- Handle missing values first (e.g., fillna or impute for Age).
- Choose candidate transforms based on the distribution (e.g., log for right skew).
- Wrap the transform with sklearn.preprocessing.FunctionTransformer:
- Example: FunctionTransformer(np.log1p) or FunctionTransformer(np.log)
- You can also pass a custom function (e.g., lambda x: x*2 + 2x).
- Use sklearn.compose.ColumnTransformer to apply transforms only to target columns (avoid transforming everything).
- Fit/transform training data, transform test data, then train models.
- Evaluate with cross‑validation (e.g., 10‑fold) to get reliable performance estimates and avoid overfitting to a single split.
- Compare results across models and transforms — sometimes transforms can hurt performance.
Key experimental findings (Titanic example)
- Fare was highly right‑skewed; applying log (or log1p) to Fare moved its distribution closer to normal and improved logistic regression accuracy (a small but measurable improvement).
- Age was already close to normal; transforming Age sometimes worsened performance.
- Decision tree performance barely changed with transforms, consistent with trees being distribution‑agnostic.
- Results vary by dataset and column — test multiple transforms and validate with cross‑validation.
Practical notes from the experiment:
- Use ColumnTransformer to avoid unnecessary computation and accidental transformation of irrelevant columns.
- Beware zeros/negatives (use log1p or add a small constant when needed).
- You can provide any custom function to FunctionTransformer — experimentation is encouraged.
Tips & caveats
- Always check distributions before and after transform (plots + metrics).
- Cross‑validate; don’t trust a single train/test split.
- Transforms are not a silver bullet; they help some models/datasets and can hurt others.
- For zeros/negatives, pick suitable transforms: Yeo‑Johnson can handle negatives; Box‑Cox cannot.
- Keep transformations column-specific and lightweight to save compute.
Note: FunctionTransformer is a convenient wrapper around numpy/math functions, while PowerTransformer (Box‑Cox and Yeo‑Johnson) can automatically find a parameterized power transform to make data more normal.
Tools / libraries referenced
- scikit‑learn: FunctionTransformer, PowerTransformer, ColumnTransformer, LogisticRegression, DecisionTreeClassifier, cross_val_score
- pandas for data handling
- numpy (np.log, np.log1p, etc.)
- matplotlib / seaborn for plotting (histograms, density plots, Q–Q plots)
Main speaker / sources
- Presenter: YouTuber / instructor (unnamed) giving a hands‑on feature‑engineering tutorial.
- Primary libraries/tools referenced: scikit‑learn, pandas, numpy, matplotlib/seaborn.
- Dataset used: Titanic dataset.
Next video preview
Deep dive into PowerTransformer with demonstrations of Box‑Cox and Yeo‑Johnson transforms.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.