Summary of "[ИАД, весна 2025] Рекомендательные системы, 7"

Lecture summary — Recommender Systems: Introduction to Bandits (Spring 2025)

High-level takeaways

Bandit problems are a natural next step after classical recommendation algorithms for settings with session-less users (e.g., banner/ad selection). They model adaptive choices where each action yields a reward (click/purchase) and we want to balance exploration and exploitation.
Three progressively complex classes of problems were described:
1. Classic multi-armed bandit (MAB): finite arms, each with an unknown reward distribution; objective is to maximize cumulative reward / minimize regret.
2. Non-stationary or state-dependent bandits: reward distributions change with time or environment state.
3. Full RL / MDP-like setting: actions influence future states and rewards may be delayed (not covered in depth).
The lecture covers:
- Basic stochastic MAB algorithms (greedy, ε-greedy, UCB family).
- A Bayesian alternative: Thompson Sampling.
- Contextual bandits with linear models (LinUCB-style), including disjoint vs. hybrid models.
Practical cautions: randomize tie-breaking to avoid getting stuck; tune priors and exploration constants (e.g., ε, alpha in UCB); choose algorithms consistent with reward and context assumptions (Bernoulli rewards, linearity, stationarity).

Formal stochastic MAB problem

Setup

K arms.
n rounds (time steps).
Each arm i has an unknown reward distribution with mean µ_i.
Each round select one arm and observe its reward (in the basic setting, immediately).
Goal: maximize cumulative reward, equivalently minimize regret relative to always picking the best arm.

Common assumptions

Rewards are often assumed bounded (e.g., in [0,1]).
Bernoulli rewards (0/1) are a common simplifying assumption for algorithms and analysis.

Algorithms (stochastic MAB)

1. Naive greedy (sample mean)

Maintain empirical average reward for each arm.
Each round choose the arm with the highest empirical average.
Drawback: can prematurely exploit a suboptimal arm and never explore other arms.

2. ε-greedy

Maintain empirical averages for each arm.
At each round:
- With probability 1 − ε: pick the arm with highest empirical average (greedy).
- With probability ε: pick a random arm (explore).
Simple and easy to implement; a small ε provides steady exploration and can overcome greedy’s pitfalls.

3. UCB (Upper Confidence Bound) family

Core idea: compute an index for each arm = empirical mean + exploration bonus.
Poorly-sampled arms get a larger bonus, encouraging exploration.
Bonus derived from concentration inequalities (e.g., Hoeffding). A typical informal form:
- index_i(t) = mean_i(t) + sqrt((alpha * log t) / (2 * n_i(t)))
- alpha (or similar constant) controls exploration strength.
Choose the arm with the largest index each round and update its statistics.
Variants use different confidence intervals (Bernoulli-specific bounds can give tighter bonuses).

4. Thompson Sampling (Bayesian / posterior sampling)

Often used for Bernoulli rewards but adaptable to other settings.
Model per-arm unknown parameter (e.g., θ_i = success probability for Bernoulli).
Use a conjugate prior (Beta(a_i, b_i)) for each arm.
At each round:
- Sample θ_i ~ Beta(a_i, b_i) for each arm.
- Select the arm with the largest sampled θ.
- Observe reward r ∈ {0,1}: update counts (if r=1 increment a_i, else increment b_i).
Simple updates and empirically effective; well-suited to Bernoulli bandits.

Contextual bandits — linear model (LinUCB-style)

Motivation

Incorporate per-round context (user/item features, time) so the chosen action can be personalized.

Linearity assumption

Expected reward is linear in features: E[reward | context x, arm a] = xᵀ θ_a.

Two modeling approaches

Disjoint (per-arm) linear models:
- Learn independent parameter vectors θ_a for each arm.
- Maintain for each arm a covariance matrix A_a and an accumulated vector b_a (or their inverses).
- Prediction = xᵀ θ_hat; exploration bonus = scaled sqrt(xᵀ A_a^{-1} x) (confidence ellipsoid).
- Choose arm maximizing prediction + bonus; update A_a and b_a for the chosen arm.
Hybrid model:
- Share some parameters across arms (a global component plus arm-specific components).
- Useful when some features are common across arms and interactions exist.

Implementation tips

Randomize tie-breaking to avoid deterministic loops.
Update covariance matrices carefully; the bonus is the multivariate generalization of scalar UCB bonuses.

Empirical behavior and comparisons

Typical observed relationships:
- Random baseline: low, constant reward.
- ε-greedy: better than random but can be suboptimal if ε is not chosen well.
- UCB and Thompson: often outperform ε-greedy; Thompson is frequently competitive in Bernoulli settings.
- Contextual methods (LinUCB/hybrid): outperform non-contextual methods when linearity holds and contexts are informative.
All methods can converge prematurely to a constant action if exploration is insufficient.
Performance depends on number of arms, reward distributions, and how informative contexts are.

Practical considerations and extensions

Use bandits in recommendation/advertising when full user history is unavailable (session-less users).
Typical applications: adaptive experimentation, online tuning, and region-specific serving policies.
Extensions and advanced topics (referenced but not covered in depth):
- Non-stationary bandits.
- Delayed rewards.
- Regret analyses and theoretical guarantees.
- Thompson Sampling variants and their theory.
- Lin-style theoretical analyses for contextual bandits.

Tip (practical)

Randomize tie-breaking and ensure sufficient exploration (via ε, UCB constants, or priors) to avoid getting stuck on suboptimal actions.

Concise algorithmic checklists (implementation)

ε-greedy

Initialize counts and empirical means for each arm.
Each round: with probability ε pick a random arm; otherwise pick argmax empirical mean.
Update the chosen arm’s count and empirical mean with the observed reward.

UCB

Initialize counts and empirical means for each arm.
Each round compute index = empirical_mean + exploration_bonus (depends on counts and round t).
Choose argmax index; update counts and empirical mean.
Tune exploration constant (alpha) if using a parametrized formula.

Thompson Sampling (Bernoulli case)

Initialize Beta(a=1, b=1) priors for each arm (or other prior).
Each round sample θ_i from each arm’s Beta; pick arm with largest sampled θ.
Observe reward r ∈ {0,1}; if r=1 increment a for that arm else increment b.

LinUCB / linear contextual UCB

For each arm maintain matrix A_a and vector b_a (or their inverses).
Each round get context x for each arm, compute θ_hat = A_a^{-1} b_a, predict xᵀ θ_hat.
Compute confidence term sqrt(xᵀ A_a^{-1} x) scaled by exploration parameter.
Select arm maximizing prediction + exploration term; observe reward and update A_a and b_a.

References and notes

Lecture recording (lecturer unspecified in subtitles).
“San’s book” referenced for detailed examples (exact text unclear).
PSE 2012 article (revisiting Thompson Sampling and prior work).
Papers on Thompson Sampling and linear contextual bandit analyses were recommended for deeper reading.
Subtitles contained several misrecognitions (e.g., “HBN inequality” = Hoeffding inequality). Core algorithm names are: ε-greedy, UCB, Thompson Sampling, and contextual/linear bandits (LinUCB: disjoint and hybrid variants).

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "[ИАД, весна 2025] Рекомендательные системы, 7"

Lecture summary — Recommender Systems: Introduction to Bandits (Spring 2025)

High-level takeaways

Formal stochastic MAB problem

Algorithms (stochastic MAB)

1. Naive greedy (sample mean)

2. ε-greedy

3. UCB (Upper Confidence Bound) family

4. Thompson Sampling (Bayesian / posterior sampling)

Contextual bandits — linear model (LinUCB-style)

Empirical behavior and comparisons

Practical considerations and extensions

Concise algorithmic checklists (implementation)

References and notes

Category

Share this summary

Is the summary off?

Video

Summary of "[ИАД, весна 2025] Рекомендательные системы, 7"

Lecture summary — Recommender Systems: Introduction to Bandits (Spring 2025)

High-level takeaways

Formal stochastic MAB problem

Algorithms (stochastic MAB)

1. Naive greedy (sample mean)

2. ε-greedy

3. UCB (Upper Confidence Bound) family

4. Thompson Sampling (Bayesian / posterior sampling)

Contextual bandits — linear model (LinUCB-style)

Empirical behavior and comparisons

Practical considerations and extensions

Concise algorithmic checklists (implementation)

References and notes

Category ?

Share this summary

Is the summary off?

Video

Category