Summary of "[ИАД, весна 2025] Рекомендательные системы, 8"

Main topic

Evaluation of recommender systems with an emphasis on making offline evaluation closer to online evaluation — i.e., counterfactual / off‑policy evaluation — particularly in the contextual bandit setting.

Problem statement / formal setup

Evaluation modes (recap)

Main approaches for offline / counterfactual evaluation

1) Direct method (simulator / reward model) - Idea: build a model that predicts reward r given (x, a); then use the model to simulate rewards for actions π_test would take. - Pros: - Potentially low variance. - Can provide estimates for actions never seen in the log. - Cons: - Hard to build an accurate model of user responses. - Biased if the reward model is incorrect. - If a very good simulator exists, it might itself replace the need for a recommender. - Practical note: industrial simulators have been tried but are often difficult to maintain.

2) Inverse propensity scoring (IPS) — “reweighting” - Idea: use propensity scores π0(a|x) (the probability that the logging policy chose action a given context x). For each logged event, weight the observed reward by π_test(a|x) / π0(a|x) and average. The intuition is to reweight observed outcomes according to how likely π_test would have chosen those same actions. - Pros: - Unbiased under assumptions (correct propensities and support). - Cons: - High variance if propensities are small or π_test and π0 differ substantially. - Undefined when π0(a|x) = 0 for actions that π_test might pick. - Variance-reduction techniques: - Clipping: truncate large importance ratios (or floor small denominators) via a hyperparameter λ — reduces variance at the cost of bias and requires tuning. - Self‑normalized IPS (SNIPS): normalize weights so they sum to one — reduces variance but introduces bias. - Importance of logging exploration parameters so propensities can be reconstructed.

3) Doubly robust (DR) estimators (combination) - Idea: combine the direct method (reward model) with IPS by correcting the model’s predictions with reweighted residuals from logged data. - Benefits: - Often lower variance than IPS and more robust to model misspecification. - Under certain conditions can be unbiased/consistent. - Practical note: DR frequently gives the best empirical performance among counterfactual estimators.

Practical recommendations and required logging

To enable reliable counterfactual evaluation, log as much relevant information as possible:

If these items are not logged, IPS-style methods become problematic or impossible; direct methods may still be attempted but will be limited by missing data.

Empirical observations

Summary of pros and cons (high level)

Literature, tools and datasets mentioned

Course / context notes

Speakers and sources

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video