Summary of "5 AI Engineer Projects to Build in 2026 | Ex-Google, Microsoft"
High-level summary
- The video presents a practical roadmap of five portfolio projects that demonstrate production-grade AI engineering skills hiring managers care about in 2026.
- Emphasis is on systems, observability, evaluation, and measurable trade-offs — not just flashy demos.
- Each project targets a distinct, in-demand skill: retrieval (RAG) systems, local model inference, monitoring/observability, fine-tuning/alignment, and real-time multimodal systems.
- For each project the speaker outlines three phases: a minimal working demo, engineering/production improvements, and a final rigorous evaluation/reporting step.
The five recommended projects
1) Production-grade Retrieval-Augmented Generation (RAG)
Purpose: Build a domain-specific “ask-my-doc” system that returns answers grounded in retrieved evidence (citations), demonstrating faithfulness.
Phases
- Phase 1 (fundamentals)
- Ingest documents (PDF/MD/web).
- Chunk into ~500–800 token pieces with ~100-token overlap.
- Embed chunks and store in a vector DB.
- Retrieve top-k and produce answers that cite source paragraphs.
- Phase 2 (production)
- Implement hybrid retrieval (BM25 + vector search).
- Add a cross-encoder re-ranker to rescore retrieved chunks.
- Enforce citation rules (decline when unsupported).
- Version-control prompts/configs.
- Phase 3 (shipping & QA)
- Create a golden evaluation dataset (50–200 Q/A pairs).
- Offline faithfulness evaluation (are claims supported by retrieved chunks).
- CI gating so PRs fail if quality drops.
Suggested stack/tools and metrics
- Orchestration: LangChain or LangGraph.
- Vector DB: ChromaDB (or alternatives).
- Re-ranker: Cohere re-ranker or sentence-transformers cross-encoder.
- RAG evaluation framework (video referenced a tool).
- Key metrics: retrieval precision/recall, citation coverage, faithfulness, CI pass/fail.
2) Local / offline small-model assistant
Purpose: Run LLMs locally for privacy/latency/cost/edge use cases and benchmark real-world trade-offs.
Phases
- Phase 1
- Install a local runtime (video suggests Olama).
- Run a 3–7B model (e.g., Llama 3 3B/7B, Mistral 7B).
- Build a CLI or small API wrapper.
- Benchmark tokens/sec, time-to-first-token, total latency.
- Phase 2
- Enforce deterministic structure (JSON schema).
- Validate outputs (e.g., Pydantic) and implement retry on invalid outputs.
- Experiment with temperature to show stochasticity effects.
- Phase 3
- Model comparison study: benchmark 3 models on same hardware (memory, throughput, output quality on 30–50 prompts).
- Report quantized variants (GGUF Q4/Q5) and quality vs speed trade-offs.
Deliverables
- Performance benchmarks, validation + retry pipeline, concise technical report comparing models.
3) Monitoring & observability for RAG systems
Purpose: Prove you can operate and debug production AI systems — tracing, metrics, dashboards, regression gating.
Phases
- Phase 1
- Instrument the pipeline: trace which chunks were retrieved, re-ranker ordering, prompt used, model response, token counts.
- Tools suggested: Langsmith, Langfuse, BrainTrust (Langfuse recommended as self-hostable).
- Phase 2
- Track quality and SRE-style metrics: latency percentiles (P50/P95), cost per request, citation coverage, failure rate.
- Build dashboards to investigate anomalies.
- Phase 3
- Connect eval dataset to CI as regression gates.
- Version prompts/configs alongside code for auditability.
Deliverables
- Traceable request logs, dashboards, regression gating in CI.
4) Fine-tuning and alignment project
Purpose: Demonstrate when fine-tuning is needed and produce measurable, task-specific improvements (not just “make model smarter”).
Suggested tasks
- Structured JSON extraction from messy text.
- Tool-call accuracy (function selection + parameter filling).
Phases
- Phase 1 (SFT)
- Supervised fine-tuning with a clean dataset (2k–10k examples).
- Use parameter-efficient fine-tuning (LoRA/QLoRA-style).
- Evaluate on JSON validity, exact match, refusal correctness.
- Phase 2 (preference tuning)
- Collect pairwise preference data (good vs worse outputs).
- Use DPO / reward/preference optimization to improve behavior beyond SFT baseline.
- Re-evaluate improvements.
Notes and tooling
- Show training curves, before/after metrics, troubleshooting notes.
- Suggested tooling: Hugging Face TRL, Axolotl (or similar), managed compute (e.g., Fireworks AI).
5) Real-time multimodal application (streaming / low-latency)
Purpose: Handle streaming data, latency budgets, and resilience for real-time use cases.
Three track examples
- Voice assistant: ASR → LLM → TTS.
- Computer-vision + LLM reasoning.
- Streaming log analyzer: real-time anomaly detection + natural-language explanations.
Phases
- Phase 1
- Implement an end-to-end streaming pipeline (WebSockets recommended) and get data flowing and responses streaming.
- Phase 2
- Latency breakdown and budgeting — measure ASR latency, LLM time-to-first-token, TTS time-to-first-byte, visualize per-request breakdown (P50/P95).
- Phase 3
- Resilience and graceful degradation — timeouts, fallbacks, replay mode for debugging, and recovery strategies when components fail.
Deliverables
- Working streaming demo, latency visualizations, resilience/recovery mechanisms.
Cross-cutting engineering practices emphasized
- Grounding and citation enforcement to avoid hallucination.
- Hybrid retrieval (BM25 + vector) + re-ranker to improve precision.
- Version prompts and configs as part of system architecture.
- CI-based regression gating using a golden evaluation set.
- Observability: tracing each request, token accounting, P50/P95 latency, cost per request, citation coverage, failure rates.
- Reproducible benchmarks and technical reports (numbers matter).
- Defensive engineering: schema validation, retries, graceful degradation, replay/debug modes.
Tools, frameworks, and resources (mentioned)
- Orchestration / RAG: LangChain, LangGraph.
- Vector stores: ChromaDB (and other vector DBs).
- Re-rankers: Cohere re-ranker, sentence-transformers cross-encoders.
- RAG evaluation: (video referenced a tool — name may be transcribed).
- Local model runtimes: Olama.
- Models: Llama 3 variants, Mistral 7B.
- Quantization formats: GGUF Q4/Q5.
- Observability: Langsmith, Langfuse (recommended, open-source), BrainTrust.
- ASR/TTS: Deepgram, Whisper (ASR); various speech synthesis tools referenced.
- Streaming/orchestration: WebSockets.
- Fine-tuning libraries: Hugging Face TRL, Axolotl (or similar), LoRA / QLoRA techniques.
- Managed compute/platform: Fireworks AI.
Note: some tool names from auto-generated subtitles may be slightly misspelled — the video description reportedly lists exact links and correct names.
Key deliverables to include in each portfolio project
- Working demo + public source code (GitHub).
- Measured benchmarks (latency, throughput, memory, tokens/sec).
- Golden evaluation dataset and automated tests that run in CI.
- Dashboards/tracing for observability and incident investigation.
- Technical report with before/after metrics, training curves, and failure notes.
- Clear README documenting decisions, trade-offs, and how to reproduce results.
Main speaker / sources
Speaker: Ashwarashan — >10 years building ML/AI systems; MS in Data Science from Columbia; experience at Microsoft, Google, IBM; led AI developer relations at Fireworks AI.
Companies/tools referenced in context: Microsoft, Google, IBM, Fireworks AI, and many open-source tools/frameworks listed above.
Conclusion
Build these five complementary projects to show full-lifecycle AI engineering: correctness/faithfulness (RAG), local inference trade-offs, operational observability, fine-tuning/alignment rigor, and real-time multimodal engineering. Each project includes concrete phases, measurable metrics, and tooling suggestions to make your portfolio stand out to hiring managers.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.