Summary of "Feature Scaling - Standardization | Day 24 | 100 Days of Machine Learning"
Overview
- Tutorial on feature scaling with a focus on standardization (z-score scaling).
- Part of Day 24 of the “100 Days of Machine Learning” series.
- Covers: what standardization is, why it’s needed, geometric intuition, a hands-on Python/sklearn demo, and guidance on when to apply it.
- A follow-up video will cover normalization (min–max and other techniques).
Definition & formula
- Standardization (z-score scaling) transforms each feature x_i to:
z = (x_i - mean) / std
- After transformation each feature has mean ≈ 0 and standard deviation ≈ 1.
- Conceptually two steps: mean-centering, then scaling by the standard deviation.
Why scaling matters
- Many ML algorithms depend on distances or gradients; features with different scales can bias results or slow/stop convergence.
- Examples where scaling matters:
- Distance-based methods: KNN, K-means, clustering, similarity measures (Euclidean distance).
- Gradient-based optimization: logistic regression (with gradient descent), neural networks — scaling helps convergence and stabilizes learning.
- PCA: relies on variance, so standardization is important before applying PCA.
- Examples where scaling is usually not necessary:
- Tree-based models and many ensemble tree methods (Decision Trees, Random Forest, Gradient Boosting, XGBoost, LightGBM) — scaling typically has little or no effect.
Geometric intuition
- In feature space, standardization shifts points to center around zero (mean-centering) and rescales axes so each has unit variance.
- The transformation preserves distribution shape but rescales the spread along each axis.
Hands-on / Practical tips
- Always split data into train/test before scaling.
- Fit the scaler on the training set only:
scaler.fit(X_train)— learn mean/std from training data.
- Apply the same transform to both train and test:
scaler.transform(X_train)scaler.transform(X_test)
- If using pandas, convert scaled NumPy arrays back to a DataFrame for easier inspection.
- Visualize distributions before and after (PDF plots,
describe()) to verify mean ≈ 0 and std ≈ 1. - Handle outliers carefully — extreme values affect mean/std and therefore the scaling result.
- Use
sklearn.preprocessing.StandardScaleras the standard tool; you can implement a custom scaler class if needed.
Demonstrated effects on model performance
- Example experiment (logistic regression):
- Unscaled features: ~65% accuracy.
- Standardized features: improved accuracy (~81% reported in the demo).
- The demo also shows injecting extreme values/outliers to illustrate how scaling behaves and to emphasize treating outliers explicitly when necessary.
Caveats & recommendations
- Standardization changes only center and scale, not the shape of the distribution.
- Always fit the scaler on training data and apply the same transform to validation/test/new data.
- Standardization generally does not harm models and is a safe default when unsure, but it is unnecessary for many tree-based models.
- Normalization (min–max scaling and other techniques) will be covered in the next video; different methods are suited to different situations.
Code / tooling mentioned
- Python, pandas, scikit-learn (
StandardScaler),train_test_split. - Plotting PDFs and using
describe()to inspect distributions. - Converting scaler output (NumPy arrays) back to pandas DataFrame for inspection.
Next steps / resources
- Next video: normalization and a comparison of min–max scaling vs standardization, with guidance on when to use each.
- Related playlist topics on the channel: logistic regression, gradient descent, PCA, etc., for deeper study.
Main speaker / source
- The instructor is the unnamed presenter of the “100 Days of Machine Learning” series; demonstrations use Python and scikit-learn.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...