Agent Outcomes — What It Means
Agent Outcomes
TL;DR: Agent Outcomes is a pattern where you define what “done” looks like (a rubric), and a separate grader evaluates whether the agent achieved it. The agent iterates until criteria are met or max attempts reached. This turns conversations into goal-oriented work.
Simple Explanation
Normally, an AI agent works until it thinks it’s done. But “thinks it’s done” is subjective — the agent might stop too early or miss requirements.
Outcomes add structure:
- You define success criteria (the rubric)
- A separate grader evaluates the work (not the agent itself)
- The agent iterates if criteria aren’t met
- Process stops when satisfied OR max iterations reached
It’s like having a QA reviewer built into the agent loop.
The Problem It Solves
Without outcomes:
User: "Build a financial model"Agent: [works for a while]Agent: "Done! Here's your model."User: "But it's missing the sensitivity analysis..."Agent: "Oh, let me add that..."User: "And the formatting is wrong..."With outcomes:
User: "Build a financial model" Rubric: - Contains DCF with 5-year projections - Includes sensitivity analysis on discount rate - Formatted with headers and cell references
Agent: [works]Grader: "Missing sensitivity analysis" → needs_revisionAgent: [adds sensitivity analysis]Grader: "All criteria met" → satisfiedHow It Works (Technical)
In Claude Managed Agents, you send an outcome definition:
client.beta.sessions.events.send( session_id=session.id, events=[{ "type": "user.define_outcome", "description": "Build a DCF model for Costco in .xlsx", "rubric": {"type": "text", "content": RUBRIC}, "max_iterations": 5, # default 3, max 20 }],)The grader is a separate evaluation process:
- Runs in its own context
- Not influenced by the agent’s decisions
- Returns per-criterion evaluation
- Feedback passes back to agent
Evaluation Results
| Result | What Happens |
|---|---|
satisfied | Work complete, session goes idle |
needs_revision | Agent gets feedback, starts new cycle |
max_iterations_reached | Final attempt, then idle |
failed | Rubric doesn’t fit task (bad criteria) |
Writing Good Rubrics
The rubric is the most important part. Good rubrics are specific and verifiable:
✅ Good Criteria
- “CSV contains a ‘price’ column with numeric values”
- “All functions have docstrings explaining parameters”
- “Report includes at least 3 competitor comparisons”
- “Total sums to within 0.01 of expected value”
❌ Bad Criteria
- “Data looks good” (subjective)
- “Code is clean” (vague)
- “Report is comprehensive” (unmeasurable)
- “User would be satisfied” (unknowable)
Rubric Template
## Required Outputs- [ ] [Specific deliverable 1]- [ ] [Specific deliverable 2]
## Quality Criteria- [ ] [Measurable quality standard 1]- [ ] [Measurable quality standard 2]
## Constraints- [ ] [Specific constraint or limit]- [ ] [Format requirement]Why a Separate Grader?
The key insight is separation of concerns:
| Self-Evaluation | Separate Grader |
|---|---|
| Agent judges own work | Independent evaluation |
| Prone to “I think I’m done” | Objective criteria check |
| Single perspective | Fresh perspective |
| Can rationalize shortcuts | No context of effort spent |
This is similar to why code review works — the author is too close to see issues.
Business Applications
Quality Assurance at Scale
- Content generation with style guidelines
- Data processing with validation rules
- Report generation with completeness checks
Reducing Back-and-Forth
- Agent self-corrects before human review
- Fewer revision cycles
- Consistent quality standards
Audit Trail
- Clear record of what was requested
- Evidence of criteria being met
- Useful for compliance
Implementing Without Managed Agents
You can implement this pattern in DIY agents:
def agent_with_outcomes(task, rubric, max_iterations=3): for i in range(max_iterations): # Agent does work result = agent.execute(task)
# Separate grader evaluates evaluation = grader.evaluate(result, rubric)
if evaluation.satisfied: return result
# Feed back to agent task = f"{task}\n\nPrevious attempt feedback: {evaluation.feedback}"
return result # Best effort after max iterationsThe key is that grader should be a separate LLM call with its own context, not just the agent evaluating itself.
Relationship to Other Concepts
| Concept | Relationship |
|---|---|
| glossary/ai-agent | Outcomes make agents goal-oriented |
| tools/claude-managed-agents | Native Outcomes support |
| automation/ai-agent-organization | Outcomes = one technique for reliability |
| glossary/prompt-engineering | Rubric writing is a form of prompting |
Key Takeaways
- Outcomes turn agent conversations into goal-oriented work
- A separate grader provides objective evaluation
- Specific, verifiable rubrics are essential
- Agents iterate until criteria met or max attempts reached
- This pattern dramatically reduces revision cycles
Related
- tools/claude-managed-agents — Platform with native Outcomes support
- glossary/ai-agent — What agents are
- automation/ai-agent-organization — Broader reliability techniques
- comparisons/managed-agents-vs-diy — Platform comparison
Sources
- Claude Managed Agents documentation (Anthropic, April 2026)
- Outcomes feature is currently in research preview