Summary of "Junior Data Scientist | Собеседование | karpov.courses"

Main ideas, concepts, and lessons

1) What a “junior data scientist” interview looks like (process + sections)

The speaker explains that an interview is a full interview, but—unlike a “real-life” interview conversation—it is structured as four sections/practices with limited time and no extended freeform discussion.


2) Section 1 (Python): mutable vs immutable + a memory/ID-driven mindset

A major Python focus is mutability and how it affects object identity in memory, including subtle behavior when modifying objects inside functions.

Key concepts covered

Practical meaning of mutability

Why it matters in interviews

Demonstrated tasks

Task A: Determine mutability behavior
Task B: Make output not mutate the input

Memory-management discussion (theory)


3) Section 4 (Experimental design / A/B testing): recommender evaluation case

The speaker uses a recommender-systems experiment as an example.

Case setup

Steps/questions included in experiment design

  1. Choose evaluation metrics

    • Define what “success” means
    • Examples:
      • Click / sticky
      • Purchase / conversion
  2. Define groups and ensure homogeneity

    • Split users into two groups:
      • Control: old recommender
      • Treatment: new recommender
    • Ensure comparability (similar composition across characteristics such as gender, age, geography)
    • Avoid strong imbalance (e.g., “boys vs girls” imbalance)
  3. Decide assignment ratio

    • Often 50/50
    • But traffic volume matters: you may allocate a portion if you need enough events for statistical power.
  4. Determine required sample size

    • Estimate how many users/events are needed to detect a meaningful difference with sufficient power.
  5. Run the experiment

    • Collect metric outcomes per group after assignment.
  6. Statistical comparison

    • Use appropriate tests depending on the metric type (rates/shares like CTR, conversion)
    • The discussion mentions converting outcomes into distributions and possible use of chi-square, including normality assumptions and limitations.
  7. Discuss distribution/normality assumptions

    • If using parametric methods, you need assumptions like approximate normality
    • Normality may not hold automatically, so alternate approaches or careful interpretation may be necessary.
  8. If time series or geo tests arise

    • Variants become more complex:
      • Geographic segmentation (different regions)
      • Switchback variants with control-like comparisons
      • Evaluations without a control group (harder)
      • Time-series experiment design (rolling windows, train/predict across time)

Additional insights


4) Section 3 (Data work): SQL joins + window functions + group-by logic implementation

The speaker covers common SQL/pandas-style interview topics.

SQL join sizes (given counts)

Given:

Expected results (conceptually):

Note: The arithmetic around OUTER/FULL JOIN was messy in the talk, but the key teaching point is to reason using intersection/union.

Window functions concept

Implementing “group by sum” without pandas (algorithm exercise)

Concept:

Methodology / algorithm:

Common pitfalls:


5) Algorithms & ML theory (linear regression, loss functions, gradient descent, overfitting)

The speaker transitions to core theoretical questions.

Linear regression basics

Error/loss functions for learning

Gradient descent concept (optimization loop)

  1. Start with initial coefficients
  2. Repeat:
    • Predict with current coefficients
    • Compute error (loss)
    • Compute gradient (derivative vector)
    • Update coefficients to reduce loss
  3. Stop when error converges or no longer improves

Overfitting and mitigation

Overfitting:

Mitigation ideas mentioned:

Trees and Random Forest (split criteria + why ensembles help)


6) Closing feedback: learning plan / what to prepare next

Recommended preparation areas:


Speakers / sources featured

Category ?

Educational


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video