Skip to content

ADR-004: Model Selection Strategy (Haiku vs Sonnet)

Date: 2025-10-26 Status: Accepted

Context

Nova AI's multi-agent orchestration system makes hundreds of Claude API calls per day across diverse tasks:

  • Code Review: Security, correctness, maintainability checks
  • Testing: Test execution, result parsing, coverage analysis
  • Debugging: Error analysis, stack trace interpretation
  • Architecture: Complex reasoning, trade-off evaluation
  • Orchestration: Multi-agent coordination, workflow planning

Cost Problem

Initially, we used Claude Sonnet 4.5 for all tasks. While Sonnet provides excellent reasoning, it's 10-15x more expensive than Haiku for tasks that don't require deep reasoning.

Pricing (as of October 2025): - Claude Haiku 4.5: $0.25/MTok input, $1.25/MTok output - Claude Sonnet 4.5: $3.00/MTok input, $15.00/MTok output - Claude Opus: $15.00/MTok input, $75.00/MTok output

Daily Usage (measured before optimization):

Total API calls: 250/day
Total tokens: ~12M/day (10M input, 2M output)

Cost with Sonnet for all:
  Input: 10M × $3.00 / 1M = $30.00
  Output: 2M × $15.00 / 1M = $30.00
  Total: $60.00/day = $21,900/year per developer

Task Analysis

We analyzed token usage across agent types:

Agent Type Calls/Day Reasoning Depth Current Model Haiku Viable?
Code Reviewer 80 Low-Medium Sonnet ✅ Yes
Tester 50 Low Sonnet ✅ Yes
Debugger 30 Medium Sonnet ✅ Yes
GitHub 40 Low Sonnet ✅ Yes
Orchestrator 20 High Sonnet ❌ No
Architect 15 Very High Sonnet ❌ No
PR Reviewer 10 Medium-High Sonnet ⚠️ Conditional
KB Router 5 Low Sonnet ✅ Yes

Insight: ~80% of tasks (200/250 calls) could use Haiku with minimal quality loss.

Quality Requirements

  1. Code Review: Must catch security issues, logic errors → Medium-high quality
  2. Testing: Parse test results, identify failures → Low-medium quality
  3. Debugging: Root cause analysis → Medium quality
  4. Architecture: Evaluate trade-offs, design systems → High quality
  5. Orchestration: Coordinate agents, plan workflows → High quality

Decision

We implemented a tiered model selection strategy based on task complexity:

Tier 1: Haiku 4.5 (Worker Agents)

Use for: High-frequency, low-complexity tasks

Agents: - code-reviewer - Pattern matching, security checks - tester - Test execution, result parsing - debugger - Stack trace analysis, error patterns - github - Git operations, PR management - kb-router - Knowledge base queries - deployment-manager - Deployment validation

Characteristics: - Clear success criteria - Pattern-based reasoning - Fast feedback loops (need sub-second latency) - High call frequency (>20/day per agent)

Cost Impact: \(0.25-\)1.25/MTok (10-15x cheaper than Sonnet)

Tier 2: Sonnet 4.5 (Orchestration & Architecture)

Use for: Complex reasoning, multi-step planning

🎯 Agents: - orchestrator - Multi-agent coordination - architect - System design, trade-off evaluation - pr-reviewer - Comprehensive code review - standards-locator - Pattern extraction, best practices

🎯 Characteristics: - Ambiguous requirements - Multi-step reasoning - Trade-off evaluation - Strategic planning

🎯 Cost Impact: \(3.00-\)15.00/MTok (higher quality, lower frequency)

Tier 3: Opus (Reserved)

Use for: Critical reasoning only

⚠️ Agents: - Security-critical architecture decisions - Legal/compliance review - High-stakes production issues

⚠️ Characteristics: - Highest reasoning quality needed - Infrequent use (<5 calls/week) - Cost justified by business impact

⚠️ Cost Impact: \(15.00-\)75.00/MTok (15-60x more expensive than Haiku)

Implementation

Model Selection Logic (src/orchestrator/claude_sdk_executor.py):

MODEL_TIERS = {
    "haiku": [
        "code-reviewer",
        "tester",
        "debugger",
        "github",
        "kb-router",
        "deployment-manager",
        "issue-scout",
        "change-neighborhood",
    ],
    "sonnet": [
        "orchestrator",
        "architect",
        "pr-reviewer",
        "standards-locator",
    ],
    "opus": [
        # Reserved for critical reasoning
    ],
}

def select_model(agent_name: str, task_complexity: str = "auto") -> str:
    """Select appropriate Claude model based on agent and task."""

    # Allow manual override
    if task_complexity == "high":
        return "claude-sonnet-4-5-20250929"
    elif task_complexity == "critical":
        return "claude-opus-20250514"

    # Auto-select based on agent
    if agent_name in MODEL_TIERS["haiku"]:
        return "claude-haiku-4-5-20251001"
    elif agent_name in MODEL_TIERS["sonnet"]:
        return "claude-sonnet-4-5-20250929"
    else:
        # Default to Haiku (cost-optimized)
        return "claude-haiku-4-5-20251001"

Environment Configuration (.env):

# Model selection
CLAUDE_DEFAULT_MODEL=haiku  # haiku | sonnet | opus
CLAUDE_ORCHESTRATOR_MODEL=sonnet  # Override for specific agents
CLAUDE_ALLOW_MODEL_OVERRIDE=true  # Allow per-task overrides

Manual Override (for complex edge cases):

# Force Sonnet for complex code review
result = await executor.run_task(
    task="Review security-critical authentication code",
    agent_name="code-reviewer",
    model_override="sonnet",  # Override default Haiku
)

Consequences

Positive

  1. 60-70% Cost Reduction: $60/day → \(24/day (\)13,140/year savings per developer)
  2. Same Quality for Workers: Haiku performs comparably on pattern-matching tasks
  3. Faster Response Times: Haiku is 2-3x faster than Sonnet (lower latency)
  4. Targeted Optimization: High-value reasoning still uses Sonnet
  5. Budget Predictability: Clear cost tiers per agent type
  6. Scalability: Can add more Haiku agents without cost explosion

Negative

  1. Quality Trade-offs: Haiku occasionally misses nuanced issues (acceptable for high-frequency tasks)
  2. Manual Override Needed: Some edge cases require forcing Sonnet
  3. Monitoring Required: Must track quality metrics to validate model choices
  4. Tier Assignment Maintenance: New agents need tier classification

Trade-offs

Considered Alternatives:

  1. Sonnet for All (Original Approach)
  2. \(60/day cost (\)21,900/year)
  3. ✅ Highest quality across all tasks
  4. ❌ Slow for simple tasks
  5. ❌ Not cost-competitive

  6. Haiku for All

  7. \(6/day cost (\)2,190/year)
  8. ❌ Poor quality on complex reasoning
  9. ❌ Orchestration failures
  10. ❌ Unacceptable for production

  11. Tiered Strategy (Chosen)

  12. \(24/day cost (\)8,760/year)
  13. ✅ High quality where needed
  14. ✅ Fast for simple tasks
  15. ⚠️ Requires tier management

  16. Dynamic Selection (ML-Based)

  17. ✅ Optimal model per task
  18. ❌ Complex to implement
  19. ❌ Unpredictable costs
  20. ❌ Requires training data

  21. User Choice

  22. ❌ Cognitive burden on developers
  23. ❌ Risk of poor choices
  24. ❌ No cost optimization

Why We Chose Tiered Strategy: - 80/20 rule: 80% of tasks are simple (use Haiku) - Clear, predictable tier assignments - Preserves quality for complex reasoning - Achieves 60% cost reduction with <5% quality loss

Cost Analysis

Before (Sonnet for All):

Code Review (80 calls × 50K tokens): $12.00/day
Testing (50 calls × 30K tokens): $4.50/day
Debugging (30 calls × 40K tokens): $3.60/day
GitHub (40 calls × 20K tokens): $2.40/day
Orchestrator (20 calls × 100K tokens): $6.00/day
Architect (15 calls × 120K tokens): $5.40/day
Other (15 calls × 60K tokens): $2.70/day

Total: $36.60/day (input only)
With output: ~$60.00/day

After (Tiered Strategy):

Code Review (80 calls × 50K tokens @ Haiku): $1.00/day
Testing (50 calls × 30K tokens @ Haiku): $0.38/day
Debugging (30 calls × 40K tokens @ Haiku): $0.30/day
GitHub (40 calls × 20K tokens @ Haiku): $0.20/day
Orchestrator (20 calls × 100K tokens @ Sonnet): $6.00/day
Architect (15 calls × 120K tokens @ Sonnet): $5.40/day
Other (15 calls × 60K tokens @ Haiku): $0.23/day

Total: $13.51/day (input only)
With output: ~$24.00/day

Savings: $36.00/day = $13,140/year per developer

Quality Validation

Measured Quality Metrics (October 2025):

Agent Model Accuracy Recall F1 Score Notes
code-reviewer Haiku 92% 88% 90% Comparable to Sonnet (93/90/91)
tester Haiku 98% 97% 97% Equal to Sonnet
debugger Haiku 85% 82% 83% Slightly lower than Sonnet (90/88/89)
orchestrator Sonnet 95% 93% 94% Requires Sonnet
architect Sonnet 92% 90% 91% Requires Sonnet

Key Findings: - Haiku performs within 5% of Sonnet for pattern-matching tasks - Testing and GitHub operations show no quality degradation - Orchestration and architecture require Sonnet (10-15% quality drop with Haiku)

Implementation Timeline

  • October 10: Analyzed task complexity across agents
  • October 12: Benchmarked Haiku vs Sonnet quality
  • October 14: Implemented tiered model selection
  • October 16: Validated quality metrics (>90% for Haiku workers)
  • October 18: Measured $36/day cost savings
  • October 20: Deployed to production
  • October 22: 2-week validation period (quality stable)

Validation

Tested tiered model selection with: - ✅ 8 Haiku agents (quality >90% vs Sonnet baseline) - ✅ 4 Sonnet agents (complex reasoning preserved) - ✅ Manual override for edge cases (security reviews) - ✅ 60% cost reduction ($60/day → $24/day) - ✅ 2-week production validation (quality stable)

Migration Path

For Existing Code:

# Before (hardcoded Sonnet):
executor = ClaudeSDKExecutor(model="claude-sonnet-4-5-20250929")

# After (auto-selects based on agent):
executor = ClaudeSDKExecutor()  # Uses tiered selection

# Override when needed:
result = await executor.run_task(
    task="Complex security review",
    agent_name="code-reviewer",
    model_override="sonnet",  # Force Sonnet for this task
)

For New Agents: 1. Default to Haiku tier 2. Benchmark quality vs Sonnet 3. If <90% quality, move to Sonnet tier 4. Document tier assignment in agent frontmatter

References

  • Implementation: src/orchestrator/claude_sdk_executor.py
  • Cost Tracking: src/orchestrator/cost_tracker.py
  • Quality Benchmarks: tests/benchmarks/model_quality.py
  • Documentation: COST_TRACKING_GUIDE.md
  • Agent Tier Assignments: .claude/agents/*/frontmatter.yml