Summary of "Обзорная консультация по курсу"

Overview

This was an hour-and-a-half consultation by Alexey (course instructor / industry researcher) giving a compact tour of recommender-system ideas and answering students’ questions. He explained classical and modern approaches, practical training/inference pipelines, evaluation issues, and production considerations. The session was interactive with multiple participant questions (including Dmitry and Anya).

Main ideas, concepts and lessons

1. Problem statement (recsys fundamentals)

Typical input: many users, a large catalog of items, and a sparse user×item interaction matrix (ratings or implicit feedback).
Goal: predict missing entries (ratings or likelihoods) to recommend the most relevant items.

2. Nearest-neighbor methods (user-KNN and item-KNN)

Intuition: predict a user’s rating by looking at K most similar users (or similar items).
Similarity measures: standard vector similarities (cosine similarity commonly used).
Practical points:
- Use K > 1 (K can be large — e.g., dozens to hundreds) to reduce variance.
- You can weight all neighbors by similarity instead of using a strict K cutoff.
- Item-KNN is analogous to user-KNN (swap users ↔ items).

3. Matrix factorization and latent-factor models

Core idea: represent users and items as low-dimensional vectors (latent factors) and reconstruct ratings via dot product.
Two principal approaches:
- Linear-algebra SVD / truncated SVD: optimal Frobenius-norm approximation; pick top-K singular vectors (e.g., 128 dims).
- Optimization-based MF (e.g., FunkSVD, ALS):
  - Minimize reconstruction error + regularization.
  - Use gradient-based methods or alternating least squares (ALS): fix P solve Q, then fix Q solve P.
  - Regularization is required to avoid overfitting.
Implicit-feedback variants:
- Re-weight implicit interactions and/or sample negatives.
- ALS with implicit reweighting and libraries such as implicit are commonly used.
Practical notes: negative sampling strategy and handling missing entries are important for quality.

4. Autoencoders and modern deep models

Autoencoders can be applied to recommendation but are less dominant than other neural approaches.
Sequence models and transformers:
- RNN-based sequential recommenders (e.g., GRU-like; GRU4Rec) predict the next item from a user’s interaction sequence.
- SASRec (self-attention sequential model) and BERT-like masked models (BERT4Rec) are strong performers for session/sequence recommendation.
Training objectives:
- Next-item prediction (shifted sequence): predict item t given history up to t−1.
- Masked modeling: mask some items and predict them (BERT-style).
Practicalities:
- Inputs: trainable item embeddings plus positional embeddings. SASRec often has no explicit per-user embedding; user context is captured by the sequence representation.
- Scoring over a full catalog is expensive — use negative sampling and in-batch negatives to reduce computation.
- Performance depends on careful sampling, batching (session-parallel mini-batches), and context length limits (e.g., 128).
Empirical caution: around 2018–2019 many complex models failed to beat strong baselines; transformer/attention-based sequential models later showed clear improvements in many domains.

5. Negative sampling and training losses

Negative sampling (how negatives are chosen) is crucial for neural recommenders: full softmax over millions of items is costly.
Techniques: in-batch negatives, pre-sampled negatives, hard negatives.
In-batch negatives treat other positives within a batch as negatives for a given example, reducing sampling cost.
Contrastive and sigmoid-based losses are used; effectiveness depends on domain and availability of explicit negative feedback.

6. Two-stage (two-level) systems: candidate generation + ranking

Common production pattern:
- Stage 1 — Candidate generation: fast models produce many candidates from the catalog (SASRec, MFs, item-KNN, embeddings, heuristics).
- Stage 2 — Ranking/re-ranking: a slower, feature-rich model (logistic regression, gradient-boosted trees) re-ranks candidates and outputs the final top-K.
Combining multiple candidate sources:
- Union candidate sets from several generators (intersection alone is often insufficient).
- Rank aggregation: inverse-rank averaging, weighted averaging, or learning a combiner (logistic regression / boosting).
Recommended training pipeline:
- Time-based splits into consecutive parts. Train candidate generators on earlier periods.
- Generate candidates on a subsequent period to form the second-stage training data.
- Construct a binary target: did any candidate become an actual positive in a later validation period?
- Train a combiner (logistic regression or boosting) on features like inverse ranks, model scores, and user/item/context features.
Practical tips:
- Use many candidates (hundreds) — 10 is usually too few.
- Add contextual and content features at the second stage for better personalization.
- Retrain periodically (daily/weekly) depending on data drift and business needs.

7. Ensembling / combining models

Combine multiple good models via learned weighting (logistic regression or boosting on ranks/scores), not by naive intersection.
Rank aggregation ideas: inverse-rank averages, learned weights, or a second-level model using features from each candidate source.
Be mindful of ranking ties and differing score scales — a learned combiner helps handle scale differences.

8. Evaluation and offline considerations

Offline evaluation is tricky because logged feedback comes from previous policies; data is biased by prior recommendations.
For bandit/ad-serving contexts, use inverse propensity scoring (IPS) and variants (with clipping/truncation to control variance) before online A/B tests.
Metric implementation differences across libraries can cause discrepancies; prevent time leakage and use consistent metric definitions.
Define evaluation protocol carefully for hit@k/HIT metrics (which items are considered candidates and how k is counted).

9. Multi-armed bandits and exploration

Use-case: small action sets (10–100) where exploration vs exploitation is critical (ads, promotions).
Algorithms:
- Epsilon-greedy: explore randomly with fixed probability.
- UCB (upper confidence bound): add an exploration bonus inversely proportional to the number of times an arm has been pulled.
Offline evaluation of bandit policies requires careful IPS-style reweighting.

10. Simulation and generative ideas

Simulator-based evaluation (train simulators of user behavior) exists but has pitfalls and is not universally used in industry.
Speculative discussion on diffusion/generative models to produce item vectors or simulate interactions: research exists but it’s not an established practical win. Challenges include vector-space alignment and realism of simulated behavior.

Practical recommendations and “rules of thumb”

Start simple: KNN and matrix factorization are good baselines and provide insight into the data.
Use a two-stage architecture in practice: candidate generation (fast, recall-focused) + rich re-ranker (precision, feature-rich).
Negative sampling strategy and in-batch negatives are essential for scalable neural training.
Use session-parallel batching for variable-length sequential data; limit context length (e.g., 128) and consider truncation/stacking.
Evaluate offline carefully: avoid leakage, consider propensity reweighting for logged-policy bias, and confirm offline wins with online experiments.
Retrain models periodically; frequency depends on dataset dynamics (daily/weekly).
Leverage domain-specific features in the second-stage model for better personalization.
Be pragmatic about complexity: large ensembles (e.g., Netflix Prize-style) can be hard to operate in production — simpler methods often succeed.

Methodologies / instruction lists

A. Implementing item- or user-KNN

Represent users (rows) or items (columns) as vectors of observed ratings/interactions.
Choose a similarity metric (cosine similarity commonly).
For each target user-item:
- Find K nearest neighbors (users for user-KNN or items for item-KNN).
- Aggregate neighbors’ ratings: average or weighted average (weights = similarity).
- Optionally use all items weighted by similarity instead of only K nearest.
Tune K and weighting; validate with cross-validation.

B. Training a matrix-factorization model (implicit/explicit)

Prepare training data: observed ratings (explicit) or implicit interactions; decide how to treat missing entries.
Choose approach:
- SVD/truncated SVD for linear-algebra approximation.
- Optimization-based MF (FunkSVD/ALS) with regularization.
If using ALS:
- Initialize latent matrices P, Q.
- Iteratively fix P and solve for Q (least squares), then fix Q and solve for P.
- Include regularization in the objective.
For implicit feedback:
- Re-weight positive interactions, sample negatives, or use implicit-ALS variants (libraries available).
Evaluate on held-out data and use time-based splits where appropriate.

C. Building a two-stage recommender (candidate generation + ranking)

Partition data by time into consecutive periods:
- Period A: train candidate generators.
- Period B: generate candidates to create second-stage training data.
- Period C: validation for the second-stage model.
- Period D: final test.
Candidate generation:
- Use fast models (SASRec, item-KNN, MFs, heuristics) to produce many candidates per user (e.g., ~1,000).
Create labeled training set for second stage:
- For each user in Period B, record candidates and whether they were interacted with in Period C (binary hit target).
Feature engineering for second stage:
- Include inverse rank / score from each candidate source, item/user features, contextual features, and historical statistics.
Train model:
- Option A: logistic regression for interpretability.
- Option B: boosting (XGBoost / LightGBM) for higher performance and richer feature handling.
Validate on Period C, then deploy: generate candidates and use the second-stage model to re-rank.

D. Training sequential / transformer recommenders

Build sequences of item interactions per session/user.
Choose objective:
- Next-item prediction (shifted sequence).
- Masked modeling (BERT4Rec-style).
Inputs:
- Trainable item embeddings + positional embeddings.
Negative sampling:
- Use in-batch negatives or sampled negatives to avoid full softmax over the catalog.
Batching:
- Use session-parallel mini-batches or pad/truncate to fixed context length.
Output scoring:
- For large catalogs, use approximations for the output layer (candidate-softmax, sampled softmax, in-batch negatives).

Speakers and sources featured

Alexey — main presenter / course instructor (works in industry/research on recommender systems).
Dmitry (“Dim”) — participant.
Anya — participant (asked about negative sampling).
Multiple unnamed students / participants — asked questions throughout the consultation.

Other models and references mentioned during the discussion:

SASRec, GRU4Rec-like RNNs, BERT4Rec / BERT-like masked models, FunkSVD/ALS, implicit library, Netflix Prize; references to researchers/groups such as “Dina” and “Google” were mentioned in passing.

Share this summary

Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Summarize another video

Summary of "Обзорная консультация по курсу"

Overview

Main ideas, concepts and lessons

1. Problem statement (recsys fundamentals)

2. Nearest-neighbor methods (user-KNN and item-KNN)