Summary of "5 AI Engineer Projects to Build in 2026 | Ex-Google, Microsoft"

High-level summary

The video presents a practical roadmap of five portfolio projects that demonstrate production-grade AI engineering skills hiring managers care about in 2026.
Emphasis is on systems, observability, evaluation, and measurable trade-offs — not just flashy demos.
Each project targets a distinct, in-demand skill: retrieval (RAG) systems, local model inference, monitoring/observability, fine-tuning/alignment, and real-time multimodal systems.
For each project the speaker outlines three phases: a minimal working demo, engineering/production improvements, and a final rigorous evaluation/reporting step.

The five recommended projects

1) Production-grade Retrieval-Augmented Generation (RAG)

Purpose: Build a domain-specific “ask-my-doc” system that returns answers grounded in retrieved evidence (citations), demonstrating faithfulness.

Phases

Phase 1 (fundamentals)
- Ingest documents (PDF/MD/web).
- Chunk into ~500–800 token pieces with ~100-token overlap.
- Embed chunks and store in a vector DB.
- Retrieve top-k and produce answers that cite source paragraphs.
Phase 2 (production)
- Implement hybrid retrieval (BM25 + vector search).
- Add a cross-encoder re-ranker to rescore retrieved chunks.
- Enforce citation rules (decline when unsupported).
- Version-control prompts/configs.
Phase 3 (shipping & QA)
- Create a golden evaluation dataset (50–200 Q/A pairs).
- Offline faithfulness evaluation (are claims supported by retrieved chunks).
- CI gating so PRs fail if quality drops.

Suggested stack/tools and metrics

Orchestration: LangChain or LangGraph.
Vector DB: ChromaDB (or alternatives).
Re-ranker: Cohere re-ranker or sentence-transformers cross-encoder.
RAG evaluation framework (video referenced a tool).
Key metrics: retrieval precision/recall, citation coverage, faithfulness, CI pass/fail.

2) Local / offline small-model assistant

Purpose: Run LLMs locally for privacy/latency/cost/edge use cases and benchmark real-world trade-offs.

Phases

Phase 1
- Install a local runtime (video suggests Olama).
- Run a 3–7B model (e.g., Llama 3 3B/7B, Mistral 7B).
- Build a CLI or small API wrapper.
- Benchmark tokens/sec, time-to-first-token, total latency.
Phase 2
- Enforce deterministic structure (JSON schema).
- Validate outputs (e.g., Pydantic) and implement retry on invalid outputs.
- Experiment with temperature to show stochasticity effects.
Phase 3
- Model comparison study: benchmark 3 models on same hardware (memory, throughput, output quality on 30–50 prompts).
- Report quantized variants (GGUF Q4/Q5) and quality vs speed trade-offs.

Deliverables

Performance benchmarks, validation + retry pipeline, concise technical report comparing models.

3) Monitoring & observability for RAG systems

Purpose: Prove you can operate and debug production AI systems — tracing, metrics, dashboards, regression gating.

Phases

Phase 1
- Instrument the pipeline: trace which chunks were retrieved, re-ranker ordering, prompt used, model response, token counts.
- Tools suggested: Langsmith, Langfuse, BrainTrust (Langfuse recommended as self-hostable).
Phase 2
- Track quality and SRE-style metrics: latency percentiles (P50/P95), cost per request, citation coverage, failure rate.
- Build dashboards to investigate anomalies.
Phase 3
- Connect eval dataset to CI as regression gates.
- Version prompts/configs alongside code for auditability.

Deliverables

Traceable request logs, dashboards, regression gating in CI.

4) Fine-tuning and alignment project

Purpose: Demonstrate when fine-tuning is needed and produce measurable, task-specific improvements (not just “make model smarter”).

Suggested tasks

Structured JSON extraction from messy text.
Tool-call accuracy (function selection + parameter filling).

Phases

Phase 1 (SFT)
- Supervised fine-tuning with a clean dataset (2k–10k examples).
- Use parameter-efficient fine-tuning (LoRA/QLoRA-style).
- Evaluate on JSON validity, exact match, refusal correctness.
Phase 2 (preference tuning)
- Collect pairwise preference data (good vs worse outputs).
- Use DPO / reward/preference optimization to improve behavior beyond SFT baseline.
- Re-evaluate improvements.

Notes and tooling

Show training curves, before/after metrics, troubleshooting notes.
Suggested tooling: Hugging Face TRL, Axolotl (or similar), managed compute (e.g., Fireworks AI).

5) Real-time multimodal application (streaming / low-latency)

Purpose: Handle streaming data, latency budgets, and resilience for real-time use cases.

Three track examples

Voice assistant: ASR → LLM → TTS.
Computer-vision + LLM reasoning.
Streaming log analyzer: real-time anomaly detection + natural-language explanations.

Phases

Phase 1
- Implement an end-to-end streaming pipeline (WebSockets recommended) and get data flowing and responses streaming.
Phase 2
- Latency breakdown and budgeting — measure ASR latency, LLM time-to-first-token, TTS time-to-first-byte, visualize per-request breakdown (P50/P95).
Phase 3
- Resilience and graceful degradation — timeouts, fallbacks, replay mode for debugging, and recovery strategies when components fail.

Deliverables

Working streaming demo, latency visualizations, resilience/recovery mechanisms.

Cross-cutting engineering practices emphasized

Grounding and citation enforcement to avoid hallucination.
Hybrid retrieval (BM25 + vector) + re-ranker to improve precision.
Version prompts and configs as part of system architecture.
CI-based regression gating using a golden evaluation set.
Observability: tracing each request, token accounting, P50/P95 latency, cost per request, citation coverage, failure rates.
Reproducible benchmarks and technical reports (numbers matter).
Defensive engineering: schema validation, retries, graceful degradation, replay/debug modes.

Tools, frameworks, and resources (mentioned)

Orchestration / RAG: LangChain, LangGraph.
Vector stores: ChromaDB (and other vector DBs).
Re-rankers: Cohere re-ranker, sentence-transformers cross-encoders.
RAG evaluation: (video referenced a tool — name may be transcribed).
Local model runtimes: Olama.
Models: Llama 3 variants, Mistral 7B.
Quantization formats: GGUF Q4/Q5.
Observability: Langsmith, Langfuse (recommended, open-source), BrainTrust.
ASR/TTS: Deepgram, Whisper (ASR); various speech synthesis tools referenced.
Streaming/orchestration: WebSockets.
Fine-tuning libraries: Hugging Face TRL, Axolotl (or similar), LoRA / QLoRA techniques.
Managed compute/platform: Fireworks AI.

Note: some tool names from auto-generated subtitles may be slightly misspelled — the video description reportedly lists exact links and correct names.

Key deliverables to include in each portfolio project

Working demo + public source code (GitHub).
Measured benchmarks (latency, throughput, memory, tokens/sec).
Golden evaluation dataset and automated tests that run in CI.
Dashboards/tracing for observability and incident investigation.
Technical report with before/after metrics, training curves, and failure notes.
Clear README documenting decisions, trade-offs, and how to reproduce results.

Main speaker / sources

Speaker: Ashwarashan — >10 years building ML/AI systems; MS in Data Science from Columbia; experience at Microsoft, Google, IBM; led AI developer relations at Fireworks AI.

Companies/tools referenced in context: Microsoft, Google, IBM, Fireworks AI, and many open-source tools/frameworks listed above.

Conclusion

Build these five complementary projects to show full-lifecycle AI engineering: correctness/faithfulness (RAG), local inference trade-offs, operational observability, fine-tuning/alignment rigor, and real-time multimodal engineering. Each project includes concrete phases, measurable metrics, and tooling suggestions to make your portfolio stand out to hiring managers.