Summary of "LangSmith Crash Course | LangSmith Tutorial for Beginners | Observability in GenAI | CampusX"

Tech focus of the video (LangSmith crash course)

The video introduces LangSmith as an end-to-end Observability + Evaluation platform for LLM applications (including LangChain, LangGraph, RAG pipelines, and agentic workflows).

The core message: LLM systems are typically black boxes and non-deterministic, so debugging/diagnosis of issues like latency, cost spikes, and hallucinations is hard without internal traces.


Why LangSmith is needed (problem scenarios)

The speaker motivates observability through three production-style failure cases:

  1. Latency regression in a multi-stage LLM workflow

    • Example: generating job-specific cover letters by processing:
      • job descriptions (JD)
      • student portfolios
      • then producing/proofreading letters
    • Symptoms: latency increases from ~2 minutes to 7–10 minutes with no visibility into which stage caused the extra time.
    • Key issue: without step-level breakdown, you can’t determine whether delays are in JD parsing, portfolio fetching, matching, or proofreading.
  2. Cost explosion in an agent that loops until “perfect”

    • Example: a research assistant fetches papers, extracts key points, and summarizes—charging tokens per report.
    • Symptoms: OpenAI/API cost jumps (e.g., from 50 paise to ₹2) because some reports trigger more iterations.
    • Root cause (hypothetical but explained): a change to an agent prompt/policy made it repeat generation until quality criteria were met more strictly for certain topics.
    • Key issue: without internal traces, you can’t see where extra token-consuming loops occurred.
  3. Hallucinations in a RAG system

    • Example: an HR chatbot answers questions from company policies using RAG.
    • Symptoms: chatbot hallucinates (e.g., wrong leave policy), potentially spreading misinformation.
    • Two hallucination sources:
      • Retriever errors: fails to fetch relevant context
      • Generator errors: the LLM answers incorrectly even when relevant context exists
    • Key issue: debugging is hard because RAG problems can originate in either retrieval or generation, and intermediate evidence isn’t exposed.

Observability definition (as presented)

Observability = ability to understand a system’s internal state by examining external outputs like logs, metrics, traces, enabling diagnosis of why something happened—even if it wasn’t anticipated.


What LangSmith provides (observability details)

LangSmith traces and records, at granular levels:


Core concepts: Project → Trace → Run

The video teaches these LangSmith abstractions:


Tutorials/demos shown (practical integration workflows)

1) Tracing a simple LangChain-style chain (LLM + prompt + parser)

The speaker runs a minimal chain and confirms the LangSmith UI shows:

2) Sequential chain tracing with multiple LLM calls

3) RAG application tracing and debugging

A RAG app is built over a local PDF:

LangSmith is used to explain retriever+generator behavior:

Two RAG-specific issues discovered in the demo

  1. Partial tracing

    • By default, only parts implemented as “runnables” appear in traces.
    • PDF loading/chunking/embedding steps were initially not fully traced.
  2. Inefficient recomputation

    • Each query reloads/chunks/embeds again, causing long latency.

Fix: improved tracing with traceable + function-level instrumentation

A RAG v2 approach:

Result in UI:

It then emphasizes that ideally there should be hierarchy: one top-level trace containing both setup + query sub-traces.

Fix: caching the vector index (latency reduction)

A RAG v4 approach uses a vector DB/index persistence strategy (mentions FAISS / a stored index):

UI comparison:


4) Agent tracing (tool-using “ReAct”-style agent)

The video demonstrates tracing an agent that:

LangSmith shows:

Another example forces multi-tool behavior:

The trace reveals wrong tool-path selection when the agent chooses the wrong city (e.g., Gurgaon vs Karnal), and intermediate logs make debugging possible.


LangGraph + LangSmith integration concept

Key mapping described:

Example graph: an essay scoring workflow with nodes evaluating:

Then aggregating into overall feedback and average scores.


Beyond observability: other LangSmith capabilities (LLM Ops)

The video states LangSmith supports an “LLM Ops” umbrella:

  1. Monitoring & Alerting

    • Monitors traces aggregated over time:
      • average latency, token usage, cost, error rate, success rate
    • Alerts trigger when metrics drift beyond thresholds (e.g., latency > X seconds)
  2. Evaluation

    • Addresses LLM non-determinism and regression risk
    • Uses standardized datasets and evaluation metrics such as:
      • faithfulness, relevance, completeness, etc.
    • Supports:
      • LLM-as-judge
      • semantic similarity checks
      • custom Python evaluators
  3. Prompt Experimentation (A/B testing prompts)

    • Test different prompt versions against a dataset
    • Evaluate and compare performance using evaluation criteria
    • Track results over time
  4. Dataset creation & annotation

    • Build/label datasets for evaluation
    • Import datasets or create empty datasets then add rows from traces
    • Versioned reuse across projects
  5. User feedback integration

    • Capture thumbs up/down and structured feedback from users
    • Tie feedback to traces/runs
    • Aggregate feedback signals for monitoring
  6. Collaboration

    • Share trace links
    • Invite teammates and share dashboards
    • Encourages team workflows (instead of manual screenshots/emails)

Main speakers / sources

Category ?

Technology


Share this summary


Is the summary off?

If you think the summary is inaccurate, you can reprocess it with the latest model.

Video