Summary of "How to make technical decisions? - Oresztesz Margaritisz, EPAM | Craft Confernece 2024"
Executive summary
The talk provides a practical framework for making technical technology/tooling decisions (e.g., databases, orchestrators, libraries, build systems) in a way that avoids costly “wrong choice” outcomes (performance issues, hidden complexity, refactoring dead-ends). It also helps secure management buy-in by translating decisions into measurable quality attributes and operational constraints.
Business impact of wrong technology choices (why it hurts)
Wrong choices tend to create long-lived problems and organizational friction:
- Stickiness / inertia: managers don’t want “months of refactoring” that delays shipping features.
- Cognitive noise: daily disruption from debugging/troubleshooting instead of feature delivery.
- Technical debt + complexity: solving one visible issue reveals broader “iceberg” complexity (config-heavy systems, unknown IDE/settings, lack of performance testing).
- “Technical debt” is hard to sell: “adaptation/tech debt” framing is often rejected by leadership.
Core “playbook” for a good technology choice
A good choice behaves like a puzzle piece:
- Solves the intended problem without extra baggage
- Fits interfaces and surrounding architecture
- Is not bloated, not overly complex
- Is new enough / maintainable enough
- Works in practice (operationally and development-wise)
Decision metrics & KPI categories (what to measure)
Instead of relying on Microsoft’s full “quality attributes” list, the speaker proposes 4 practical categories.
1) Design (fit + ecosystem)
- Community maturity / time invested (meaningful ecosystems need ~5 years)
- API feature coverage (what you need, without unnecessary complexity)
- Replaceability (cost/effort to remove if wrong)
- Simplicity (explicitly: simplicity is king)
2) Learning curve (developer productivity)
- Documentation quality (existence + usability)
-
Fast feedback loop Startup/build/test cycle time; example threshold mentioned: 30 seconds is not fast (prefer milliseconds/seconds).
-
Team expertise availability (internal capability matters)
3) Runtime (operational and technical constraints)
- Hard constraints (examples: HTTP/S limitations, single sign-on integration, open telemetry compatibility, required interfaces)
- Performance needs (latency/throughput and responsiveness)
- Stability expectations
- Security constraints
- Single points of failure risk
4) Support (failure handling & operations)
- Debugging / logging / monitoring / exception visibility
- Testability
- Avoid “non-pluggable complexity” (e.g., “snowflake” systems like plugin-heavy CI/CD that can crash or take minutes due to plugin ordering/startup)
How to measure when you don’t have an internal toolkit (data sources)
The talk suggests lightweight external proxies and checks.
-
Maturity / popularity
- Google Trends
- GitHub language stats
- GitHub activity/open issues
- Stack Overflow tag trends
-
Security
- Check open vulnerabilities; avoid high/critical vulnerabilities that are not fixed
-
Performance (without expensive measurement)
- Use existing benchmarks (but understand that benchmark validity can be misleading)
- Use third-party performance aggregation (example: TechEmpowerment site) for rough baselines
-
Stability / roadmap risk
- Review the issue tracker: closure time and unresolved issues impacting upcoming versions
Concrete example: choosing a container orchestrator
Decision question: Kubernetes vs simpler container options (e.g., Docker Compose)
- If you’re a small team and need speed:
- start from the simplest option that meets requirements
- validate with a subset of metrics (e.g., simplicity, usability, fast feedback)
- Choose Kubernetes only if it matches the needs better; otherwise, avoid unnecessary complexity.
- Include migration effort implicitly via “replaceability/cost” logic.
- Example reasoning: containers can run across runtimes, so migration might be less risky than expected.
Advanced techniques (when deeper analysis is warranted)
Used as optional “depth,” not the default:
- Capacity planning (not precise; can become an endless tunnel)
- model-based estimates from existing performance metrics
- check SLA / latency targets
- consider cloud RTT (round-trip time) across infra layers
- Cost estimation
- use cloud calculators (example: AWS Pricing Calculator)
- compare service SLAs (example: DynamoDB availability vs other components)
- Queuing modeling (Q modeling)
- tools/libraries can estimate throughput/latency without expensive full performance runs
- Architecture Decision Records (ADRs / decision logs)
- snapshot decisions for future engineers
- Technology Radar
- lightweight internal tracking of:
- popularity trend (rising/falling)
- alternatives
- why selected
- lightweight internal tracking of:
How to run the decision process (lean approach)
The emphasis is that the approach matters more than the specific tool choice.
Set-based design + Last Responsible Moment (lean techniques)
- Investigate multiple alternatives in parallel
- each sub-team/engineer focuses on a different metric or alternative
- then consolidate findings
- Defer the final choice until you have enough evidence
- decide late when possible; earlier only when constraints force it
Concrete tactics to defer/contain risk
- Parallel investigation with a structured summary
- Feature toggles
- hide the tech choice behind a flag
- allow turning implementations on/off even in production
- AB testing
- validate behavior/requirements with real usage before finalizing
- Deprecation planning
- for public-facing technology, deprecation can take weeks/months and be expensive
Organizational / leadership alignment guidance
- Management buy-in improves when you present decisions as numbers (not “technical debt” rhetoric).
- Example: instead of qualitative “adaptation,” show a trend line of increasing technical adaptation tickets.
- Cost framing matters:
- leadership often rejects “unbounded” estimates (e.g., “$500k/month cloud” type reactions)
- Translation rule:
- convert technical risk into measurable business impact (time, cost, operational stability)
“Awesome decision-making recipe” (step-by-step playbook)
Recommended high-level recipe:
- Don’t rely on one person’s opinion or deep dive.
- Don’t depend on “Googling/chatGPT” for best choice (context dependent).
- Limit metrics based on time:
- if you have ~1–1.5 months: consider up to 4 metrics
- if you have ~1–2 weeks: consider ~1–3 metrics
- Run measurement collection across the team:
- score each technology per category (e.g., 1–10 or 1–5 stars)
- share results via an Excel/Confluence-like sheet
- aggregate and pick a winner
- If disagreements remain:
- revisit using lean tactics: feature toggles, AB testing, and structured elimination
Examples from Q&A (additional actionable insights)
-
Brainstorming that led to real adoption
- The speaker used “throw crazy ideas” in meetings.
- Outcome example: repeated suggestion of monorepo for a microservices context; the team discussed merging because it was harder to find code across repos.
-
Best brainstorming structure
- Use a mind map:
- central core idea
- branching into options/alternatives
- Rules:
- “no stupid ideas” in phase 1
- judging/elimination in phase 2
- Use a mind map:
-
How to involve management
- show statistically backed trends
- provide cost estimates within constraints
-
When you’ve dug too deep
- use a time-box (limit duration; avoid long single-alternative research)
- if you’re building without automated tests, you’re likely over-investing
-
How to involve management/engineering without conflict
- avoid arguing/pushing personal preference (“I’ve used this for 10 years”)
- use collaboration + experimentation
Key concrete thresholds/targets mentioned
While no explicit business KPIs like CAC/LTV/churn appear, several operational thresholds were suggested:
- Maturity proxy: ecosystems should have meaningful usage over ~5 years
- Learning speed target: avoid environments needing ~30 seconds to start; prefer much faster feedback
- Time-box guidance for decision research:
- typical cap for “single alternative deep dive”: ~1–1.5 months maximum (and ideally shorter)
- Practical metric scope based on time:
- ~4 metrics for ~1–1.5 months
- 1–3 metrics for ~1–2 weeks
- Deprecation risk: public-facing tech deprecation can take weeks/months
Presenters / sources (as mentioned)
- Oresz (Oresztesz) Margaritisz — Chief Software Engineer, EPAM
External references mentioned:
- Microsoft (quality attributes/metrics approach referenced)
- InfoQ (technology radar / templates mentioned)
- Thoughtworks (technology radar referenced)
- EPAM (technology radar mentioned as available internally/free)
- AWS (pricing calculator, SLA examples mentioned)
- TechEmpowerment (performance benchmarking site mentioned)
Books mentioned in Q&A/closing:
- “How We Decide” (title as spoken)
- “Emotional Intelligence” (title as spoken; exact subtitle not provided)
Category
Business
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.