Summary of "[ИАД, весна 2025] Рекомендательные системы, 8"

Main topic

Evaluation of recommender systems with an emphasis on making offline evaluation closer to online evaluation — i.e., counterfactual / off‑policy evaluation — particularly in the contextual bandit setting.

Problem statement / formal setup

Logged interaction data D is collected by a known (or partially known) logging policy π0. Each data point is a triplet (x, a, r): context/state x, action a, observed reward r.
Goal: estimate the expected reward (performance) of a new policy π_test using only the logged data — in other words, answer “what would happen if we deployed π_test?”
Fundamental challenge (missing counterfactuals): the log contains rewards only for actions taken by π0. Rewards for actions that π_test would take but π0 rarely or never took are unobserved, making reliable offline metrics difficult.

Evaluation modes (recap)

Offline evaluation: evaluate on historical logged data (cheap but biased/limited).
Online evaluation (A/B testing): run the new policy in production and measure live performance (unbiased but costly and risky).
User studies: solicit explicit feedback from a sample of users (useful for qualitative insights, limited scale).

Main approaches for offline / counterfactual evaluation

1) Direct method (simulator / reward model) - Idea: build a model that predicts reward r given (x, a); then use the model to simulate rewards for actions π_test would take. - Pros: - Potentially low variance. - Can provide estimates for actions never seen in the log. - Cons: - Hard to build an accurate model of user responses. - Biased if the reward model is incorrect. - If a very good simulator exists, it might itself replace the need for a recommender. - Practical note: industrial simulators have been tried but are often difficult to maintain.

2) Inverse propensity scoring (IPS) — “reweighting” - Idea: use propensity scores π0(a|x) (the probability that the logging policy chose action a given context x). For each logged event, weight the observed reward by π_test(a|x) / π0(a|x) and average. The intuition is to reweight observed outcomes according to how likely π_test would have chosen those same actions. - Pros: - Unbiased under assumptions (correct propensities and support). - Cons: - High variance if propensities are small or π_test and π0 differ substantially. - Undefined when π0(a|x) = 0 for actions that π_test might pick. - Variance-reduction techniques: - Clipping: truncate large importance ratios (or floor small denominators) via a hyperparameter λ — reduces variance at the cost of bias and requires tuning. - Self‑normalized IPS (SNIPS): normalize weights so they sum to one — reduces variance but introduces bias. - Importance of logging exploration parameters so propensities can be reconstructed.

3) Doubly robust (DR) estimators (combination) - Idea: combine the direct method (reward model) with IPS by correcting the model’s predictions with reweighted residuals from logged data. - Benefits: - Often lower variance than IPS and more robust to model misspecification. - Under certain conditions can be unbiased/consistent. - Practical note: DR frequently gives the best empirical performance among counterfactual estimators.

Practical recommendations and required logging

To enable reliable counterfactual evaluation, log as much relevant information as possible:

Context x, chosen action a, and observed reward r.
The candidate set shown to the user (to account for filtering, ranking constraints, multi‑step selection).
The logging policy’s propensity π0(a|x), or exploration parameters that allow propensities to be reconstructed.
Model weights/ordering or enough metadata to reproduce which items were eligible and with what probabilities.
If exploration/randomization was used, record the randomization distribution.

If these items are not logged, IPS-style methods become problematic or impossible; direct methods may still be attempted but will be limited by missing data.

Empirical observations

MSE of estimators vs. number of samples:
- More data generally improves accuracy for most estimators.
- Direct method can plateau when model bias dominates; its error may stop decreasing with more data.
- DR and SNIPS often show among the lowest errors in experiments, frequently close to each other.
When logging and target policies differ a lot:
- Direct/simulator methods can help because they provide estimates for unseen items, but a poor simulator is risky.
- IPS suffers when π_test selects many actions that π0 rarely or never selected, due to high variance or undefined ratios.

Summary of pros and cons (high level)

A/B test (online)
- Pros: unbiased, direct measurement.
- Cons: expensive, risks harming user experience.
Direct (simulator / reward model)
- Pros: low variance if accurate; can estimate unseen actions.
- Cons: hard to model accurately; biased if wrong; may duplicate recommender functionality.
IPS (and variants like SNIPS)
- Pros: unbiased when assumptions hold; principled reweighting.
- Cons: needs logged propensities; can have very high variance when policies differ or propensities are tiny/zero.
Doubly robust
- Pros: combines strengths of direct and IPS; often reduces variance and increases robustness.
- Cons: depends on both a reasonable reward model and good propensity estimates (though more forgiving than direct alone).

Literature, tools and datasets mentioned

General counterfactual / off‑policy policy evaluation literature and tutorials (two tutorials were referenced in the lecture).
Example graphs cited from a source named “Ereksis.”
Off‑policy libraries/tools (approximate name from the transcript: “offp webbit”).
OpenBandit / openbandit.set dataset: logged data collected under multiple policies (random and Bernoulli), useful for evaluating and comparing off‑policy estimators.

Course / context notes

Earlier course material covered standard logged-data (first‑party) recommendation paradigms (sites with signed users and full logs).
The most recent lectures focused on the right‑side scenario: recommendations through third‑party sites, ads, partial/noisy user identity, and privacy constraints.
Administrative reminders: homework and technical specification deadlines; slides include links to resources and GitHub.

Speakers and sources

Lecturer (primary speaker / course instructor) — delivered the lecture and explained methods.
Students / audience — asked brief logistical questions during Q&A.
Referenced external sources/tools:
- “Ereksis” (source of some graphs).
- Off‑policy web/library referenced as “offp webbit” (name approximate from transcript).
- openbandit.set / OpenBandit dataset.
- General counterfactual / off‑policy evaluation literature and two tutorials.