Summary of "[ИАД, весна 2025] Рекомендательные системы, 7"
Lecture summary — Recommender Systems: Introduction to Bandits (Spring 2025)
High-level takeaways
- Bandit problems are a natural next step after classical recommendation algorithms for settings with session-less users (e.g., banner/ad selection). They model adaptive choices where each action yields a reward (click/purchase) and we want to balance exploration and exploitation.
- Three progressively complex classes of problems were described:
- Classic multi-armed bandit (MAB): finite arms, each with an unknown reward distribution; objective is to maximize cumulative reward / minimize regret.
- Non-stationary or state-dependent bandits: reward distributions change with time or environment state.
- Full RL / MDP-like setting: actions influence future states and rewards may be delayed (not covered in depth).
- The lecture covers:
- Basic stochastic MAB algorithms (greedy, ε-greedy, UCB family).
- A Bayesian alternative: Thompson Sampling.
- Contextual bandits with linear models (LinUCB-style), including disjoint vs. hybrid models.
- Practical cautions: randomize tie-breaking to avoid getting stuck; tune priors and exploration constants (e.g., ε, alpha in UCB); choose algorithms consistent with reward and context assumptions (Bernoulli rewards, linearity, stationarity).
Formal stochastic MAB problem
Setup
- K arms.
- n rounds (time steps).
- Each arm i has an unknown reward distribution with mean µ_i.
- Each round select one arm and observe its reward (in the basic setting, immediately).
- Goal: maximize cumulative reward, equivalently minimize regret relative to always picking the best arm.
Common assumptions
- Rewards are often assumed bounded (e.g., in [0,1]).
- Bernoulli rewards (0/1) are a common simplifying assumption for algorithms and analysis.
Algorithms (stochastic MAB)
1. Naive greedy (sample mean)
- Maintain empirical average reward for each arm.
- Each round choose the arm with the highest empirical average.
- Drawback: can prematurely exploit a suboptimal arm and never explore other arms.
2. ε-greedy
- Maintain empirical averages for each arm.
- At each round:
- With probability 1 − ε: pick the arm with highest empirical average (greedy).
- With probability ε: pick a random arm (explore).
- Simple and easy to implement; a small ε provides steady exploration and can overcome greedy’s pitfalls.
3. UCB (Upper Confidence Bound) family
- Core idea: compute an index for each arm = empirical mean + exploration bonus.
- Poorly-sampled arms get a larger bonus, encouraging exploration.
- Bonus derived from concentration inequalities (e.g., Hoeffding). A typical informal form:
- index_i(t) = mean_i(t) + sqrt((alpha * log t) / (2 * n_i(t)))
- alpha (or similar constant) controls exploration strength.
- Choose the arm with the largest index each round and update its statistics.
- Variants use different confidence intervals (Bernoulli-specific bounds can give tighter bonuses).
4. Thompson Sampling (Bayesian / posterior sampling)
- Often used for Bernoulli rewards but adaptable to other settings.
- Model per-arm unknown parameter (e.g., θ_i = success probability for Bernoulli).
- Use a conjugate prior (Beta(a_i, b_i)) for each arm.
- At each round:
- Sample θ_i ~ Beta(a_i, b_i) for each arm.
- Select the arm with the largest sampled θ.
- Observe reward r ∈ {0,1}: update counts (if r=1 increment a_i, else increment b_i).
- Simple updates and empirically effective; well-suited to Bernoulli bandits.
Contextual bandits — linear model (LinUCB-style)
Motivation
- Incorporate per-round context (user/item features, time) so the chosen action can be personalized.
Linearity assumption
- Expected reward is linear in features: E[reward | context x, arm a] = xᵀ θ_a.
Two modeling approaches
- Disjoint (per-arm) linear models:
- Learn independent parameter vectors θ_a for each arm.
- Maintain for each arm a covariance matrix A_a and an accumulated vector b_a (or their inverses).
- Prediction = xᵀ θ_hat; exploration bonus = scaled sqrt(xᵀ A_a^{-1} x) (confidence ellipsoid).
- Choose arm maximizing prediction + bonus; update A_a and b_a for the chosen arm.
- Hybrid model:
- Share some parameters across arms (a global component plus arm-specific components).
- Useful when some features are common across arms and interactions exist.
Implementation tips
- Randomize tie-breaking to avoid deterministic loops.
- Update covariance matrices carefully; the bonus is the multivariate generalization of scalar UCB bonuses.
Empirical behavior and comparisons
- Typical observed relationships:
- Random baseline: low, constant reward.
- ε-greedy: better than random but can be suboptimal if ε is not chosen well.
- UCB and Thompson: often outperform ε-greedy; Thompson is frequently competitive in Bernoulli settings.
- Contextual methods (LinUCB/hybrid): outperform non-contextual methods when linearity holds and contexts are informative.
- All methods can converge prematurely to a constant action if exploration is insufficient.
- Performance depends on number of arms, reward distributions, and how informative contexts are.
Practical considerations and extensions
- Use bandits in recommendation/advertising when full user history is unavailable (session-less users).
- Typical applications: adaptive experimentation, online tuning, and region-specific serving policies.
- Extensions and advanced topics (referenced but not covered in depth):
- Non-stationary bandits.
- Delayed rewards.
- Regret analyses and theoretical guarantees.
- Thompson Sampling variants and their theory.
- Lin-style theoretical analyses for contextual bandits.
Tip (practical)
Randomize tie-breaking and ensure sufficient exploration (via ε, UCB constants, or priors) to avoid getting stuck on suboptimal actions.
Concise algorithmic checklists (implementation)
ε-greedy
- Initialize counts and empirical means for each arm.
- Each round: with probability ε pick a random arm; otherwise pick argmax empirical mean.
- Update the chosen arm’s count and empirical mean with the observed reward.
UCB
- Initialize counts and empirical means for each arm.
- Each round compute index = empirical_mean + exploration_bonus (depends on counts and round t).
- Choose argmax index; update counts and empirical mean.
- Tune exploration constant (alpha) if using a parametrized formula.
Thompson Sampling (Bernoulli case)
- Initialize Beta(a=1, b=1) priors for each arm (or other prior).
- Each round sample θ_i from each arm’s Beta; pick arm with largest sampled θ.
- Observe reward r ∈ {0,1}; if r=1 increment a for that arm else increment b.
LinUCB / linear contextual UCB
- For each arm maintain matrix A_a and vector b_a (or their inverses).
- Each round get context x for each arm, compute θ_hat = A_a^{-1} b_a, predict xᵀ θ_hat.
- Compute confidence term sqrt(xᵀ A_a^{-1} x) scaled by exploration parameter.
- Select arm maximizing prediction + exploration term; observe reward and update A_a and b_a.
References and notes
- Lecture recording (lecturer unspecified in subtitles).
- “San’s book” referenced for detailed examples (exact text unclear).
- PSE 2012 article (revisiting Thompson Sampling and prior work).
- Papers on Thompson Sampling and linear contextual bandit analyses were recommended for deeper reading.
- Subtitles contained several misrecognitions (e.g., “HBN inequality” = Hoeffding inequality). Core algorithm names are: ε-greedy, UCB, Thompson Sampling, and contextual/linear bandits (LinUCB: disjoint and hybrid variants).
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...