Summary of "AI Agents 2 - Prompt Engineering."
High-level overview
- Purpose: Prompt engineering is the practice of designing, testing, and iterating the textual inputs (prompts and templates) given to generative AI (mainly LLMs) so agent systems produce higher‑quality outputs. It’s essential for AI agents because LLMs sit at the core and prompts steer how they use tools, memory, and planning to act on user goals.
Core workflow (iterative)
- Assemble data (often including some labeled examples).
- Design a prompt template.
- Run generation with an LLM.
- Optionally extract/parse model outputs.
- Score outputs with a utility function against ground truth.
- Modify the template and repeat until acceptable performance is reached.
Key definitions
- Prompt: Input to a generative model that guides output.
- Prompt template: A string with placeholders (role, instruction, context/examples, input data, constraints) that you fill per request.
- Prompt engineering: The cycle of inferring, evaluating, and modifying prompts to improve model outputs.
Techniques taxonomy (from a Feb 2025 systematic review)
The literature groups prompting strategies into six major families. Below are important techniques, what they do, and practical takeaways.
1) Zero‑shot techniques (no exemplars in the prompt)
Examples:
- Emotion prompting (append human‑relevance phrases).
- Role prompting (persona).
- Style prompting.
- System2Attention (S2A: remove irrelevant context).
- Theory‑of‑mind simulation (remove unseen info to get a character’s perspective).
- Rephrase‑and‑respond (clarify ambiguous questions).
- Reread (ask twice).
- SelfAsk (pose & answer subquestions).
Takeaways:
- Small, cheap tweaks can help.
- Use context‑filtering or rephrasing when the failure mode is noisy or ambiguous.
- Role/emotion/style often give only minor gains and can require brute‑force search to optimize.
2) Few‑shot techniques (include examples)
Examples:
- Exemplar generation (let the model synthesize examples if you have none).
- Exemplar selection & ordering (choose diverse, representative few shots and test permutations).
- Instruction selection (better to write key instructions yourself).
Takeaways:
- A few high‑quality, human‑validated, diverse examples help substantially.
- Prefer real, annotated examples; avoid synthetic exemplars unless necessary.
- Ordering and selection matter — validate on a small labeled test set.
3) Thought generation / Chain‑of‑Thought (CoT)
Variants and examples:
- Zero‑shot CoT: append “let’s think step‑by‑step” or similar.
- Few‑shot CoT: provide examples that include reasoning chains.
- Analogical prompting, step‑back prompting (review facts first), thread of thought (select pertinent info before reasoning).
- Table/tabular conversion (represent text as table for numeric tasks).
- Program‑of‑thought (represent reasoning as code).
- Tree of thought (search tree of multiple reasoning branches).
- Complexity‑based selection, contrastive prompting, autodirected CoT, memory of thought, uncertainty routing.
Takeaways:
- Use CoT for harder reasoning / math / logic tasks.
- Combine with few‑shot exemplars for bigger gains.
- Validate multiple strategies and use sampling/aggregation methods where helpful.
4) Ensembling (multiple prompts/agents + aggregation)
Examples:
- Demonstration ensembling (multiple few‑shot prompts, aggregate).
- Max mutual information (choose template maximizing mutual information between input and model output).
- Mixture of reasoning experts (specialized experts / prompts).
- Self‑consistency (sample multiple chains, majority vote).
- Universal self‑consistency (let an LLM adjudicate diverse answers).
Takeaways:
- Start with self‑consistency (sampled outputs + majority vote).
- Use specialized experts when queries vary by type.
- If compute is constrained, pick a single strong template via max mutual information.
5) Self‑criticism / verification
Examples:
- Chain of verification (generate answers, verify subclaims).
- Self‑refine (model criticizes and rewrites its output).
- Multi‑persona roles (proposer, verifier, reporter loop with stopping criteria).
Takeaways:
- Add self‑critique/verification passes when uncertain; they trade extra compute for improved accuracy.
- They provide useful stopping and validation mechanisms.
6) Decomposition
Examples:
- Least‑to‑most (decompose problem into subproblems and solve sequentially).
- Program‑of‑thought (encode reasoning as executable code).
- Tree‑of‑thought (branching search of reasoning paths).
Takeaways:
- Decomposition helps complex, multi‑step problems or planning/search.
- Keep substeps atomic (short and single‑purpose) to avoid recreating complexity within subprompts.
- These methods can be compute‑expensive.
Other technical points & tools
- Prompt tuning (soft prompts): trainable vectors prepended to model input—distinct from prompt engineering; useful for low‑compute task specialization (not covered in depth in the lecture).
- Prompt Wizard (Microsoft): an example tool that mutates prompts, scores variants, and uses an LLM to critique top prompts (automated prompt optimization pipeline).
- Prompt mining: analyze and replace words to match common pretraining corpora to leverage richer pretraining signal (advanced/industrial technique).
- LLMs referenced: Gemini, ChatGPT/GPT, LLaMA, plus code‑tuned models for tabular/numeric tasks.
Practical guidelines / synthesis (actionable rules)
- Match technique to the failure mode (e.g., use S2A or thread‑of‑thought for noisy/overlong context; R2R for ambiguity; SelfAsk/CoT for multi‑hop reasoning).
- Use examples: few‑shot is almost always helpful; prefer real, annotated examples; keep them small, diverse and representative; experiment with ordering.
- Use chain‑of‑thought for conventionally harder reasoning problems and validate with few‑shot CoT.
- Start with cheaper methods (reread, self‑consistency) before adding expensive verification/ensemble/decomposition steps.
- Validate and iterate with a small labeled test set. Define a utility function and stop criteria. Avoid blind “slot‑machine” brute‑force prompting—be principled.
- Watch compute/cost tradeoffs: ensembling, tree searches and repeated verification improve accuracy but increase API calls and latency.
Reviews, guides, and tutorials referenced
- The lecture summarizes a large systematic review (primary paper: February 2025 systematic review of prompt engineering techniques).
- Microsoft Prompt Wizard was demoed as an example tool for automated prompt mutation and critique.
- The lecture provides a rapid survey of many academic papers (links indicated in the original slides).
Main speaker / sources
- Speaker: Professor Gassimi (lecture presenter).
- Primary source repeatedly cited: February 2025 systematic survey paper on prompt engineering techniques. Other referenced items include Microsoft Prompt Wizard and multiple academic studies.
End of summary.
Category
Technology
Share this summary
Is the summary off?
If you think the summary is inaccurate, you can reprocess it with the latest model.
Preparing reprocess...