Summary of "Junior Data Scientist | Собеседование | karpov.courses"
Main ideas, concepts, and lessons
1) What a “junior data scientist” interview looks like (process + sections)
The speaker explains that an interview is a full interview, but—unlike a “real-life” interview conversation—it is structured as four sections/practices with limited time and no extended freeform discussion.
- Structure: 4 sections (sometimes split over time; sometimes back-to-back), total often around ~2+ hours
- Common content of the 4 sections:
- Python + business-theory basics, with small coding/problems
- Algorithms / ML basics (intro-level algorithm questions)
- Data work (practical questions working with data)
- Experimental design / A/B testing (often including designing an experiment, especially for junior+)
2) Section 1 (Python): mutable vs immutable + a memory/ID-driven mindset
A major Python focus is mutability and how it affects object identity in memory, including subtle behavior when modifying objects inside functions.
Key concepts covered
- Immutable types:
int,float,str,tuple - Mutable types:
list,dict - (Also mentioned in passing: other “collections” such as sets/containers, with minor correction during discussion.)
Practical meaning of mutability
- If you modify a mutable object (e.g., add a key to a dict), it may change the original object, even when that object is referenced in multiple places.
Why it matters in interviews
- Inside a function, modifying inputs can cause unpredictable side effects unless you explicitly copy.
Demonstrated tasks
Task A: Determine mutability behavior
- Define a dictionary
a - Pass it to a function
foo(a)that addsz = 99(via insertion into the dict) - Compare:
- whether the output and original are equal
- and/or whether their memory identity changes using
id()
Task B: Make output not mutate the input
- Modify the function so it returns a transformed dict without modifying the original
- Approaches discussed:
- Shallow copy:
b = a.copy() - Deep copy:
b = copy.deepcopy(a)(safer general approach; mentioned as an idea)
- Shallow copy:
- Verify using
id(a)vsid(b)
Memory-management discussion (theory)
- Garbage collection: memory can be reclaimed when objects are no longer referenced.
- Allocation/resizing: lists can “reserve” capacity and grow when thresholds are reached (conceptually like doubling/over-allocation).
- Garbage collector behavior: traverses references; details depend on how objects/instances/classes are referenced.
3) Section 4 (Experimental design / A/B testing): recommender evaluation case
The speaker uses a recommender-systems experiment as an example.
Case setup
- An old recommendation model is used on a site.
- A new model is launched.
- You must design an experiment to evaluate whether the new model is better.
Steps/questions included in experiment design
-
Choose evaluation metrics
- Define what “success” means
- Examples:
- Click / sticky
- Purchase / conversion
-
Define groups and ensure homogeneity
- Split users into two groups:
- Control: old recommender
- Treatment: new recommender
- Ensure comparability (similar composition across characteristics such as gender, age, geography)
- Avoid strong imbalance (e.g., “boys vs girls” imbalance)
- Split users into two groups:
-
Decide assignment ratio
- Often 50/50
- But traffic volume matters: you may allocate a portion if you need enough events for statistical power.
-
Determine required sample size
- Estimate how many users/events are needed to detect a meaningful difference with sufficient power.
-
Run the experiment
- Collect metric outcomes per group after assignment.
-
Statistical comparison
- Use appropriate tests depending on the metric type (rates/shares like CTR, conversion)
- The discussion mentions converting outcomes into distributions and possible use of chi-square, including normality assumptions and limitations.
-
Discuss distribution/normality assumptions
- If using parametric methods, you need assumptions like approximate normality
- Normality may not hold automatically, so alternate approaches or careful interpretation may be necessary.
-
If time series or geo tests arise
- Variants become more complex:
- Geographic segmentation (different regions)
- Switchback variants with control-like comparisons
- Evaluations without a control group (harder)
- Time-series experiment design (rolling windows, train/predict across time)
- Variants become more complex:
Additional insights
-
Randomization
- Consistent per-user assignment can be done via hashing user id
- Helps prevent cross-contamination when traffic is sufficiently randomized
-
Bootstrap
- Mentioned as a way to compare distributions of performance (speaker is somewhat unsure but frames it as answering “how different” questions)
-
Cross-validation
- Mentioned as a way to validate robustly and reduce overfitting
- Includes stratification and time-series CV concepts
4) Section 3 (Data work): SQL joins + window functions + group-by logic implementation
The speaker covers common SQL/pandas-style interview topics.
SQL join sizes (given counts)
Given:
- Table A has 100 rows
- Table B has 50 rows
- Intersection has 25 rows
Expected results (conceptually):
- INNER JOIN:
25 - LEFT JOIN (A left B):
100 + 50 − 25 = 125(speaker’s number stated was inconsistent) - OUTER JOIN: union-like reasoning applies (should be
100 + 50 − 25 = 125) - FULL JOIN (FULL OUTER JOIN): union-like reasoning should be
125(speaker’s numbers were inconsistent) - CROSS JOIN:
100 × 50
Note: The arithmetic around OUTER/FULL JOIN was messy in the talk, but the key teaching point is to reason using intersection/union.
Window functions concept
- Window functions compute values over a window (partitioned and ordered)
- Common analytics example: moving average
- Highlighted as a frequent junior question.
Implementing “group by sum” without pandas (algorithm exercise)
Concept:
- Inputs:
b: group labels (“keys”)a: numeric values (“values”)
- Output:
- For each unique key in
b, compute the sum of corresponding values froma
- For each unique key in
Methodology / algorithm:
- Initialize an empty dict:
dt = {} - For each
(key, value)pair:- If the key isn’t in
dt, initialize it with0 - Accumulate:
dt[key] += value
- If the key isn’t in
- Verify it matches expected aggregation
Common pitfalls:
- Doing
dt[g] += 1(or similar) fails if the key doesn’t exist yet - Fix by using
dt.get(key, 0)or initializing before incrementing
5) Algorithms & ML theory (linear regression, loss functions, gradient descent, overfitting)
The speaker transitions to core theoretical questions.
Linear regression basics
- Model form:
y = a0 + b*x - Coefficients are learned during training.
Error/loss functions for learning
- Either:
- Sum of squared errors (easier to optimize)
- Sum of absolute errors (harder to optimize)
- Motivation for squared error:
- It has a smooth “parabola” shape, which is easier to minimize.
Gradient descent concept (optimization loop)
- Start with initial coefficients
- Repeat:
- Predict with current coefficients
- Compute error (loss)
- Compute gradient (derivative vector)
- Update coefficients to reduce loss
- Stop when error converges or no longer improves
Overfitting and mitigation
Overfitting:
- Model fits training data too well, performing worse on unseen/test data.
Mitigation ideas mentioned:
- Regularization (penalize large weights)
- Feature selection / removing correlated features (mentioned conceptually)
- Train/test split discipline
- Cross-validation (including stratification)
- Time-series validation conceptually for temporal data
- Collect more/new production data to test generalization
Trees and Random Forest (split criteria + why ensembles help)
-
Decision trees:
- Recursively split data using informative features and split points
- Uses impurity/uncertainty measures like entropy (and mentions related uncertainty concepts)
- Leaves make class/value decisions
-
Why random forest works:
- Each tree is trained with randomness:
- random subset of features
- potentially random subset of samples (bootstrap)
- Key condition stated:
- If each tree performs better than random guessing (e.g., > 0.5 accuracy for classification), the ensemble tends to improve
- Each tree is trained with randomness:
-
When logistic regression might outperform trees
- Mentioned as an asterisk case:
- many parameters, huge data, and certain noise/feature conditions can make trees less effective
- Intuition given:
- linear models can generalize well by leveraging lots of data, while small/noisy tree splits may fail
- Mentioned as an asterisk case:
6) Closing feedback: learning plan / what to prepare next
Recommended preparation areas:
-
Python
- Go beyond basics
- Watch lectures and implement core algorithms
-
ML algorithms
- Practice implementing from scratch:
- gradient descent
- stochastic gradient descent
- (bootstrap mentioned)
- tree construction logic
- Practice implementing from scratch:
-
Statistics / experiments / data work
- Practice A/B testing and implement analytics/statistics tasks with real datasets
- Learn SQL/window functions through practice in tools/simulators
-
Modeling & metrics
- Learn quality metrics (e.g., RMS/RMSE) and interpret them
- Understand which metric/loss to use and why
Speakers / sources featured
- Dmitry (interviewer/participant; asked questions and guided parts of the exercise)
- Karpov (referenced as the course/platform: “karpov.courses” / “Karpov’s courses”)
- Sasha (another participant referenced near the end)
- An unnamed lecturer/interview host (main narrator; explains concepts and runs the structured segments)
Category
Educational
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.