Summary of "PDS week 1.3"

Summary — main ideas, concepts and lessons

What data science is

Data science is an interdisciplinary field using scientific methods, processes and systems to extract knowledge or insights from data in various forms.

Informal view: data scientists combine stronger-than-average skills in statistics and programming plus domain knowledge — making them rare “unicorns.”
Core components (Venn diagram): computer science, math & statistics (often machine learning), and subject-matter expertise. The intersection of these three is data science.

Key roles and skills of a data scientist

Technical: programming (Python/R), statistics, machine learning, data engineering.
Domain: subject-matter expertise specific to the application area.
Communication: storytelling and visualization to explain findings and recommend actions to stakeholders.
Scientific mindset: use of the scientific method (hypothesis, testing, validation), not just ad-hoc analysis.

Realistic expectations and pitfalls

Correlation ≠ causation. Large datasets often produce statistically significant but spurious relationships (e.g., spurious correlations like parrots vs. global temperature).
Data science is powerful but not magic — companies have both major successes and notable failures (examples: Target’s predictive analytics successes and Target Canada’s closure; Google’s early Flu Trends success and later underperformance).
Maintain realistic goals and measures of success; models are rarely perfect.

Common example applications

Customer behavior and recommendation systems (search, product/movie recommendations).
Retail predictive analytics (Target identifying pregnancy signals).
Epidemiological forecasting (Google Flu Trends — early success, later issues).
Shopping center patterns: weekly visit cycles (7-day periodicity).
Movie recommender user behavior: new users explore more genres; preferences often stabilize after ~2–3 weeks.

Methodology — typical data science process (iterative / non-linear)

Set / identify the research goal
- Clarify the business context, purpose, expected outcomes and how results will be used.
- Ask: What? Why? How?
- Define measures of success (metrics, improvement targets).
- Agree deliverables, resources and timeline via a project charter (report, source code, prototype, deployment).
Retrieve and acquire data
- Identify required internal and external data sources.
- Consider access, permissions, differing definitions across teams, and practical barriers.
- Data may require collection (experiments) or access requests.
Data preparation (pre-processing)
- Data cleansing: remove errors, fix inconsistencies (e.g., gender labels, impossible ages), detect outliers.
- Data transformation: convert variables for modeling needs (e.g., transforms to satisfy linear assumptions; compute per-capita from totals).
- Data combining: join/merge datasets carefully and perform sanity checks.
- Rule of thumb: garbage in → garbage out. Fix errors early to reduce downstream cost.
Data exploration
- Use summary statistics (mean, median, std) and visualizations (histograms, line/bar charts, scatter plots).
- Identify distributions, outliers, missingness and trends (e.g., weekly patterns); generate hypotheses.
- Iteratively return to data preparation if new issues are discovered.
Data modeling
- Choose model family driven by the research question and data type (classification, regression, clustering, recommendation).
- Consider constraints: numerical vs categorical data, explainability requirements, production/deployment environment.
- Train/evaluate using training/validation/test splits or held-out sets; use subsets when full data is huge.
- Use appropriate evaluation metrics (classification: precision, recall, F1; regression: RMSE, MAE; business KPIs).
- Balance performance, interpretability and deployability (e.g., deep learning may be less explainable).
Presentation and automation
- Storytelling: produce clear, stakeholder-focused reports and visualizations explaining methods, results, and recommended actions.
- Report structure: cover page, abstract/executive summary (purpose, method, results, conclusions, recommendations), introduction, detailed methodology (reproducible level), results (facts), discussion/conclusion (interpretation, implications, future work).
- Automation / productionization: package code/process so results can be re-run on new data and deliver artifacts (code, models, dashboards).
- Use appropriate output forms: tables, figures and structured narratives depending on audience.

Training and evaluation practicalities

Use held-out data to evaluate models and avoid overfitting.
Training on subsets is common when the full dataset is too large.
Choose evaluation metrics aligned with the business objective (e.g., a 5% improvement target).
Sanity checks and incremental validation are critical.

Reporting and reproducibility

Abstract should concisely cover: purpose, method, results, conclusions and recommendations.
Methodology must be detailed enough to allow independent reproduction of the work.
Results section: report factual findings without interpretation.
Discussion/conclusion: interpret results, explain applicability, limitations and future research directions.
Deliverables should include contact/author information, date and table of contents.

Practical lessons and best practices

Fix errors as early as possible; do sanity checks after joins/transformations.
Visualize data to detect outliers and patterns quickly.
Be explicit about assumptions, measures of success and deployment constraints.
Storytelling is as important as technical modeling to make impact.
Expect iteration — the process is rarely strictly linear.
Use appropriate tools and libraries (Python/R); Python is noted for extensive libraries.

Sources, examples and references mentioned

Target (retail predictive analytics; pregnancy-targeting story).
Charles Duhigg / New York Times Magazine excerpt (quoted material about retailers and Target).
Google (Google Flu Trends example — initial promise later failed).
Microsoft Cortana project (example student/project collaboration referenced).
Student letter about participating in MIT Melbourne Data Centre / a data competition (project that followed the six-step process).
60-second introduction to data science video (short clip used to map the six steps).
RMIT resources (guidance suggested for writing reports).
Course lecturer/instructor (primary speaker presenting the lecture and slides).