Summary of "Жив ли ML в 2026? Статистическое исследование"

What this video is about

A single presenter (author of the free learning platform nareka.ru) performed a large statistical study of the Russian machine‑learning job market (snapshot: March 2026). Goal: measure who the typical ML candidate is, how employers behave, how visible candidates are on the main job platform, domain dynamics (who moves where), and what predicts vacancy response volumes.

Data, scale and core numbers

Final “competitive core” after aggressive cleaning:
- ~944 candidate resumes
- 654 vacancies
- 380 employers
Initial raw harvest was much larger; cleaning removed ~97% of initial noisy/irrelevant items.
Data combined three sides: resumes, vacancies, and platform scoring outputs.

Methodology

Data collection
- Queried multiple endpoints (couldn’t rely on a single source).
- Used domain keywords (ML Research, Data Scientist, ML Engineer, etc.), then cleaned results.
Cleaning / filtering pipeline
- Five manual/automated passes removed irrelevant roles (backend, full‑stack, pure analysts, unrelated engineers), spam accounts and “zombie” vacancies.
- Platform frontend scoring used as a first filter; a threshold of 0.25 was adopted to drop ~90% of garbage while keeping plausible candidates.
Multi‑scoring capture
- Collected three platform scoring outputs per candidate‑vacancy:
  - recommendation neural network
  - textual relevance / search logit
  - front‑end / “suable” / packaging score
- Evaluated ~27,000 candidate–vacancy pairs.
- Intersection of top‑100 by all three systems = 168 candidates (the “core elite”).
Text extraction & structuring
- Parsed job titles, free‑text descriptions and resume duty descriptions into structured JSON.
- Used an LLM (Gemini 3 Fast mentioned) with prompt engineering and structured‑output validation; automatic retry when schema mismatches occurred.
Statistical validation
- Applied sampling formulas (Cochran formula referenced) to decide how many LLM outputs to manually check.
- Manual checks on random samples (sample sizes ~200 at 95% confidence) produced acceptable extraction accuracy.
- Bootstrap used to build confidence intervals for medians and other statistics.
Data engineering / reproducibility
- Strict contracts and field validation at each pipeline stage.
- Produced a single master candidate file and a master vacancy file.
- Analysis automated in Python (large codebase, ~13k lines); used statsmodels, scikit‑learn, XGBoost, etc.
Analytical models
- Regression models used to explain “responses per vacancy” (dependent variable). Key predictors identified in results.

Main findings — demographics and careers

Age
- Median age overall ≈ 26 years (bootstrap interval ~26–27).
- NLP / LLM subdomain median ≈ 24 (very young).
- Large cohort concentrated at ages 21–25 — a “boom” cohort that entered ML around 2021–2023.
- 30% of candidates hid age; imputation recovered many (70% revealed / restored).
Geography and gender
- Moscow dominant (~57% of vacancies and candidates); St. Petersburg second.
- Women underrepresented (~10–12%).
- Some candidates relocated to Kazakhstan, Belarus, Uzbekistan, Georgia and compete remotely.
Education
- ~80% have higher education; ~34% have a master’s; ~8% have a PhD / candidate of sciences.
- Top universities (HSE, MIPT, MSU and others) supply a noticeable share.
- Majority graduated in STEM/engineering/applied math fields.
Experience
- Median total professional experience ≈ 4.3 years; median ML‑specific experience ≈ 3.3 years.
- About 30% are “re‑rollers” (moved into ML from backend/devops/analytics/research).
- 50% switched between ML subdomains during their careers.

Job market structure and domains

Domain split and dominance
- NLP / LLM: dominant subdomain (roughly half of vacancies), high retention and strong inbound mobility.
- Classic ML (tabular), computer vision (CV), recommender systems and MLOps are distributed unevenly.
- MLOps is relatively scarce (vacancies > qualified candidates), representing an opportunity.
- Sber is a major employer (~15–20% share; ~83 vacancies in the sample); Ozon is next but much smaller.
Vacancy profile
- Many vacancies are senior‑oriented: >50% request senior-level candidates; mid positions are relatively rare (~17%).
- Typical tasks: model training, data pipelines, architecture, monitoring, deployment. Data labeling appears in a minority (~9%).
- Remote work is common (≈43%).
Responses & competition
- Median responses per vacancy ≈ 212 (a few hundred applicants per posting).
- Strong predictors of high response counts: declared salary and remote format. When controlled for, domain differences are small.
- Top vacancies by responses tend to be broad / easy-to-apply postings and attract many unqualified applicants.
Salaries
- Salary disclosure is rare: ≈17% of candidates disclosed desired salary; employers also often omit ranges.
- Public vacancy ranges appear lower than actual closed offers (understated by ~100–200k RUB in some comparisons).
- Approximate medians (from limited data): vacancy postings ~220–320k RUB; candidate expectation median ~270k RUB. Interpret with caution (small sample).

Skills, resume writing and scoring

Skills
- 20,000 raw skill mentions reduced to ~109 normalized skills covering ~80% of mentions.
- Important skills: mathematical statistics, probability, Python, SQL, Linux, ML libraries and domain tools (with domain variance).
Platform scoring and resume behavior
- Three distinct scoring systems (recommendation NN, textual search/logit, packaging/visibility) are not strongly correlated overall but agree on a small core (168 candidates).
- Candidates appearing in all three top lists tend to have strong packaging, quantified results (metrics, percentages), explicit experience, polished resumes and sometimes photos.
- Packaging (title, keywords, completeness, photo) affects visibility.
Practical resume advice implied
- Quantify achievements (numbers, percentages).
- Mention production details: deployment, monitoring, architecture.
- Ensure ATS/scoring‑friendly formatting (titles, keywords, completeness).
- Be aware that photos and other packaging elements on the platform can influence visibility.

Dynamics, mobility and career lessons

NLP / LLM is a growth hub that attracts and retains people from other ML domains — competition will increase there.
MLOps has higher demand than supply — good opportunity for those with backend/DevOps experience.
Grade promotion dynamics
- Many candidates advance grade (mid→senior) when changing employers; a large fraction increase grade on transition.
- Median time‑to‑senior can be relatively short (order of ~1–1.5 years after reaching mid level), suggesting market mobility accelerates promotions versus internal-only paths.
Market health and outlook
- The Russian ML market (March 2026 snapshot) is active, not “dead”.
- Entry threshold is high (math, statistics, probability); the field is more academically oriented than frontend development.

Technical / tool notes

LLM extraction: Gemini 3 Fast used to parse descriptions into structured outputs with validation loops.
Statistical techniques: Cochran formula for sample sizing, bootstrap for medians, regressions to identify predictors.
Analysis implemented in a large automated Python pipeline with validation contracts; libraries referenced include statsmodels, scikit‑learn and XGBoost.

Practical takeaways / actionable lessons

If entering ML in Russia (2026):
- Expect strong competition from academically trained, young cohorts (median age 24–26).
- Focus on core mathematics/statistics plus practical ML engineering (deployment, monitoring).
- Produce an ATS‑friendly resume with quantified results and production experience.
- Consider MLOps if you have backend/DevOps skills — demand is high and supply limited.
- Disclose a realistic salary range (vacancies that disclose salary attract more applicants).
- For faster grade progression, changing companies strategically often yields faster jumps than internal promotion.
For researchers/analysts conducting market studies:
- Document selection and cleaning procedures carefully — raw counts can be misleading if spam/garbage are not removed; scoring thresholds and cleaning rules matter.

Limitations and cautions

Salary conclusions rely on a small disclosed subset — interpret with caution.
Platform proprietary scoring is complex and not fully transparent; non‑obvious signals (photo, packaging) can influence scores.
This study is a snapshot (March 2026) tied to the platform and candidate behavior at that time.
Presenter emphasized transparency: selection and cleaning were documented for reproducibility.

Extras and follow‑ups

Additional analyses planned/available: candidate archetypes (e.g., “re‑rollers”), company donor/retention analyses, deeper elite‑candidate studies.
Full dataset/map/study and comments are available on the presenter’s platform (nareka.ru).

Speakers / sources mentioned

Presenter / video author: creator of nareka.ru and author of the analysis (sole speaker).
Companies explicitly mentioned: Sber, Ozon.
Tools / platform elements: nareka.ru, platform ATS and scoring (recommendation NN, textual search/logit, packaging visibility), SBERTs (referenced), Gemini 3 Fast (LLM).
Statistical / ML libraries referenced: statsmodels, scikit‑learn, XGBoost.