Summary of "Hyperparameter Tuning & Optimization"

Main ideas and lessons

Good ML models require more than choosing an algorithm
- After selecting an algorithm, you must find the best configuration, defined by hyperparameters, so the model performs optimally.
Model evaluation should happen in stages
1. Prototyping / development: pick candidate models and hyperparameters using validation (offline evaluation).
2. Deployment / real-world use: evaluate the final selected model using a separate testing phase.
Hyperparameter tuning improves performance
- Systematically search for hyperparameter values that maximize predictive performance and generalization.
Avoid data leakage
- Do not mix training data with evaluation data; otherwise results are overly optimistic.
Different evaluation strategies exist for limited data
- Use techniques like hold-out validation, cross-validation, and bootstrapping/resampling to simulate independent evaluation datasets.
A/B testing validates improvements in real conditions
- Compare a new model against an old model using control vs experimental user groups and apply statistical hypothesis testing.

Video structure / key topics covered

1) Prototyping phase (model development overview)

The prototyping phase is the initial cycle where you build, try, and evaluate a prototype model before full implementation.

Processes in prototyping (detailed)

Historical data → splitting
- Start with collected historical data.
- Split into:
  - Training data: used to fit the model.
  - Validation data: used to measure performance after training.
Feature selection
- Select the most relevant features from the dataset.
- Avoid features that:
  - add noise,
  - increase complexity,
  - reduce accuracy.
- Example concept: features like academic grades and attendance may be more predictive of student graduation than phone numbers or home addresses.
Model selection
- Choose an algorithm based on:
  - data type,
  - analysis purpose,
  - problem complexity.
- Examples:
  - Decision tree for interpretable classification.
  - Linear regression for predicting continuous values.
  - Neural networks for more complex problems.
- Test and compare multiple candidate models.
Training method
- Train the model using training data.
- Consider:
  - how training data is divided,
  - number of iterations,
  - batch size,
  - optimization technique(s).
- Goal: learn patterns while avoiding:
  - overfitting,
  - underfitting.
Validation stage
- Evaluate during development to ensure generalization.
- Use techniques such as:
  - hold-out validation,
  - cross-validation.
- Use validation results to:
  - adjust hyperparameters,
  - choose the best model.
Hyperparameter tuner / iterative improvement
- Try combinations of hyperparameters.
- If validation is poor, change hyperparameters, retrain, and re-evaluate (iterative loop).
Final step before deployment
- After selecting best model + best hyperparameters:
  - retrain using all available data (merge training + validation) so the final model benefits from more data.

2) Offline evaluation (evaluation using data available before deployment)

Offline evaluation runs experiments before applying the model to real conditions.

Common offline evaluation techniques

Hold-out validation
Cross-validation
Bootstrapping / resampling

What you test during offline evaluation

Different feature sets
- Try various combinations to see which influence performance most.
Different algorithms
- Compare model types to find the best fit for data/system requirements.
Different hyperparameter configurations
- Tune settings like:
  - number of neurons (neural networks),
  - learning rate (running rate value),
  - number of trees (random forest),
  - K in KNN (“k in canerus/k-nearest neighbor” as transcribed).

Why evaluate fairly on two datasets (train vs independent validation/testing)

The model must perform well on new, unseen data.
If you evaluate only on training data, it may memorize rather than learn patterns (overfitting).
Therefore evaluation data should be:
- statistically independent from training data.
Outcome: estimate generalization error (how well it generalizes).

If you only have one dataset: generate additional evaluation splits

Use methods to create additional “independent” evaluation scenarios:

Hold-out method: e.g., train 80%, test 20%
Resampling:
- cross-validation
- bootstrapping

3) Differences between validation and testing (and proper pipeline)

Validation (during prototyping)
- Purpose: compare candidate models and hyperparameter settings.
- Uses validation datasets to select:
  - the best model,
  - the best hyperparameters.
Testing (after prototyping is complete)
- Purpose: measure final performance of the chosen model in a truly final evaluation stage.

Core principle

Never mix training data with evaluation data
- Otherwise results become unfair/overly optimistic.

Example described in the video (ImageNet competition “cheating” scandal): A team repeatedly submitted/adjusted models based on test feedback. This effectively performed hyperparameter tuning on the test set, causing:

test-set overfitting,
misleadingly good reported performance,
less true scientific progress.

4) Hyperparameter tuning and optimization

Hyperparameter tuning vs cross-validation

Cross-validation
- A data-splitting mechanism to create training/validation splits.
- Goal: evaluate model performance more fairly/stably.
Hyperparameter tuning
- The search process to find best hyperparameter combination.

Hyperparameters (examples mentioned)

learning rate
number of neurons
K value in KNN
tree depth
batch size

Regularization hyperparameters (concept)

A hyperparameter controlling model capacity/complexity.
High capacity can overfit; too low can underfit.
Goal: balance flexibility and generalization.

Methodology: Hyperparameter tuning approaches (detailed bullets)

Grid search

How it works
- Define a grid of candidate values for each hyperparameter.
- Example described:
  - decision tree parameter (e.g., “number of leaves”): try values like 10, 20, 30, … up to 100.
- Test all combinations in the grid.
- For each combination:
  - train the model,
  - evaluate it using the validation set or via cross-validation,
- Select the “winner” (best performance).
Why it can be expensive
- Many combinations → many training/evaluation runs → high computation time.
Notes on scaling
- For hyperparameters like regularization, the video suggests exponential scaling is often used because effects can be sensitive across wide ranges.

Random search

How it works
- Define a hyperparameter search space (ranges).
  - Example given: learning rate 0.001 to 1, batch size 163,264, number of neurons 64/512 (as transcribed).
- Randomly sample a subset of combinations from the space.
- Train/evaluate only those sampled combinations.
- Select best performing sampled configuration.
Why it’s cheaper
- Far fewer trials than grid search.
Pros/cons
- Pros: efficient in large hyperparameter spaces; often good results.
- Cons: may miss the true best combination since it doesn’t test everything.

“Smart tuning” (sequential/efficient search)

Core idea
- Evaluate a small number of hyperparameter combinations.
- Use the results to choose subsequent combinations that look more promising.
Goal
- Reduce the number of evaluations and save compute time while still finding good/best hyperparameters.
Rationale
- Model training is expensive, especially for deep learning—so fewer trials are desirable.

Optimization levels: conceptual distinction

Hyperparameter tuning = external optimization (meta-optimization)
- Outer loop choosing hyperparameters.
- Each trial triggers training.
Model training = internal optimization
- Inner loop where the model learns parameters (weights/biases/coefficients) by minimizing loss.

5) A/B testing (EB testing)

What it is

A/B testing compares two versions of a model/design using real users.
- A (control group): old model currently in use
- B (experimental group): new model

Why it’s used

To answer: Is the new model better than the old model?
Use statistical hypothesis testing to make data-driven decisions.

Hypotheses described

H0: new model does not change the average of the main evaluation metric.
H1: new model does change the average (i.e., has real impact).

Decision rule:

If H0 is rejected → evidence suggests the new model is better.
If H0 is not rejected → not enough evidence to claim improvement.

Step-by-step process outlined

Randomly divide users
- Control group (old model) vs experimental group (new model)
- Random assignment is important for fairness/balanced groups.
Observe outcomes/metrics
- Examples mentioned:
  - click rate
  - usage time
  - number of purchases
  - accuracy
Run statistical tests
- Compute a test statistic (examples mentioned: z/t statistics) and derive a p-value
- Purpose: measure how big the difference is between groups.
Make a decision
- If p-value < 0.05, results are treated as statistically significant
- Then the new model can be implemented.

Final caution

A/B testing depends on:
- valid experimental design,
- correct statistical assumptions,
- correct interpretation.
Otherwise results can be misleading and lead to wrong business decisions.

Speakers / sources featured

No specific person or external source is named.
The content appears to be delivered by an unnamed instructor/narrator speaking directly throughout the subtitles.