Skip to content

ADR-003: Cost Tracking with LangFuse

Date: 2025-10-26 Status: Accepted

Context

Large Language Model (LLM) API costs can escalate rapidly in production environments, especially with multi-agent orchestration systems. Nova AI executes dozens of agent interactions per workflow, each making multiple API calls to Claude models (Haiku 4.5, Sonnet 4.5).

Cost Visibility Problem

Without detailed tracking, we faced:

  1. Unknown Cost Attribution: Which agents/workflows consume most tokens?
  2. Optimization Blindness: Cannot identify expensive operations to optimize
  3. Budget Overruns: No early warning system for cost spikes
  4. Model Selection Uncertainty: Unclear if Haiku vs Sonnet choices are cost-effective
  5. Cache Effectiveness: Unknown prompt caching hit rates and savings

Initial Attempts

Manual Logging:

# ❌ Fragmented, inconsistent, no aggregation
logger.info(f"Tokens used: {response.usage.total_tokens}")

Response Metadata Only:

# ❌ Per-call tracking, no workflow-level view
cost = response.usage.total_tokens * price_per_token

CSV Export:

# ❌ No real-time visibility, manual analysis
with open("costs.csv", "a") as f:
    f.write(f"{timestamp},{agent},{tokens},{cost}\n")

None provided: - Real-time dashboards - Workflow-level aggregation - Agent performance comparison - Cache hit rate analytics - Cost trend analysis

Requirements

  1. Real-time Cost Tracking: See costs as they accumulate
  2. Multi-level Attribution: Per-agent, per-workflow, per-task
  3. Cache Visibility: Measure prompt caching effectiveness (target: >70% hit rate)
  4. Model Comparison: Compare Haiku vs Sonnet cost/performance trade-offs
  5. Budget Alerts: Notify when approaching cost thresholds
  6. Historical Analysis: Identify trends and anomalies
  7. Production-Ready: Low overhead, reliable, scalable

Decision

We adopted LangFuse as our LLM observability and cost tracking platform.

Why LangFuse?

Open Source: Self-hostable, no vendor lock-in ✅ Real-Time Dashboards: Web UI for live cost monitoring ✅ Trace-Level Attribution: Hierarchical trace → span → generation tracking ✅ Automatic Cost Calculation: Built-in model pricing (Anthropic, OpenAI) ✅ Prompt Caching Metrics: Tracks cache hits, savings, effectiveness ✅ Multi-Agent Support: First-class support for agent orchestration ✅ Low Overhead: <10ms per trace (negligible latency) ✅ Python SDK: Native integration with Claude SDK

Architecture

┌─────────────────────────────────────────────────────────┐
│ ClaudeSDKExecutor                                      │
│                                                         │
│  run_task("Fix bug") ──────────────────────┐          │
│                                              │          │
│  ┌───────────────────────────────────────┐ │          │
│  │ LangFuse Trace                        │ │          │
│  │ - trace_id: workflow-001              │◀┘          │
│  │ - name: "Fix bug"                     │            │
│  │ - metadata: {agent: "debugger"}       │            │
│  │                                        │            │
│  │  ┌────────────────────────────────┐  │            │
│  │  │ Span: KB Search                │  │            │
│  │  │ - latency: 3ms                 │  │            │
│  │  │ - tokens: 0 (cached)           │  │            │
│  │  └────────────────────────────────┘  │            │
│  │                                        │            │
│  │  ┌────────────────────────────────┐  │            │
│  │  │ Generation: Claude API Call    │  │            │
│  │  │ - model: claude-haiku-4-5      │  │            │
│  │  │ - input_tokens: 1200           │  │            │
│  │  │ - output_tokens: 350           │  │            │
│  │  │ - cached_tokens: 800 (66%)     │  │            │
│  │  │ - cost: $0.0042                │  │            │
│  │  └────────────────────────────────┘  │            │
│  └───────────────────────────────────────┘            │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ LangFuse Dashboard (Self-Hosted)                       │
│                                                         │
│  📊 Real-Time Metrics:                                 │
│  - Total cost: $24.50 (today)                          │
│  - Cache hit rate: 72% (target: >70%)                  │
│  - Most expensive agent: code-reviewer ($8.20)         │
│  - Avg latency: 1.2s per workflow                      │
│                                                         │
│  📈 Trends:                                             │
│  - Cost trending down (-15% this week)                 │
│  - Cache hit rate improving (+5%)                      │
│  - Haiku usage up 40% (cost optimization working)      │
└─────────────────────────────────────────────────────────┘

Implementation

CostTracker Integration (src/orchestrator/cost_tracker.py):

from langfuse import Langfuse
from langfuse.decorators import observe

class CostTracker:
    def __init__(self):
        self.langfuse = Langfuse(
            public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
            secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
            host=os.getenv("LANGFUSE_HOST", "http://localhost:3000"),
        )

    @observe(as_type="trace")
    def track_workflow(self, task: str, agent: str) -> ContextManager:
        """Create trace for workflow-level tracking."""
        return self.langfuse.trace(
            name=task,
            metadata={"agent": agent, "mode": detect_mode()},
        )

    @observe(as_type="generation")
    def track_api_call(self, model: str, response) -> None:
        """Track individual Claude API call with cost."""
        self.langfuse.generation(
            model=model,
            input=response.request.messages,
            output=response.content,
            usage={
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "cache_read_tokens": response.usage.cache_read_input_tokens,
                "cache_creation_tokens": response.usage.cache_creation_input_tokens,
            },
            # LangFuse auto-calculates cost based on model pricing
        )

Executor Integration (src/orchestrator/claude_sdk_executor.py):

class ClaudeSDKExecutor:
    def __init__(self, project_root: Path):
        self.cost_tracker = CostTracker()

    async def run_task(self, task: str, agent_name: str) -> Dict:
        """Execute task with cost tracking."""
        with self.cost_tracker.track_workflow(task, agent_name) as trace:
            # Execute agent workflow
            response = await self._execute_agent(task, agent_name)

            # Track API call costs
            self.cost_tracker.track_api_call(
                model=self.model,
                response=response,
            )

            return response

Environment Configuration (.env):

# LangFuse Configuration
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=http://localhost:3000  # Self-hosted
LANGFUSE_ENABLED=true

Key Metrics Tracked

  1. Cost Attribution:
  2. Per-agent costs (e.g., code-reviewer: $8.20/day)
  3. Per-workflow costs (e.g., PR review: $2.15)
  4. Per-model costs (Haiku: $15/day, Sonnet: $9/day)

  5. Cache Effectiveness:

  6. Cache hit rate (target: >70%)
  7. Cache savings ($4.50/day from 72% hit rate)
  8. Cache read tokens vs fresh tokens

  9. Performance:

  10. Avg latency per agent (code-reviewer: 1.8s)
  11. Token throughput (1200 tokens/second)
  12. API call frequency (50 calls/workflow)

  13. Trends:

  14. Daily cost trends (up/down %)
  15. Cache hit rate over time
  16. Model usage shifts (Haiku vs Sonnet)

Consequences

Positive

  1. 60-70% Cost Reduction Visibility: Tracked Haiku adoption reducing costs
  2. Cache Hit Rate Monitoring: Achieved 72% cache hit rate (>70% target)
  3. Agent-Level Attribution: Identified code-reviewer as 35% of costs
  4. Data-Driven Optimization: Used metrics to guide model selection strategy
  5. Budget Confidence: Real-time alerts prevent overruns
  6. Performance Insights: Correlated cost with latency and quality
  7. $4,380/Year Savings: Calculated from cache effectiveness and Haiku adoption

Negative

  1. Self-Hosting Overhead: Requires running LangFuse server (Docker)
  2. Network Dependency: Tracking fails if LangFuse server is down (graceful degradation)
  3. Storage Growth: Traces accumulate (~100MB/day for heavy usage)
  4. Learning Curve: Team needs to learn LangFuse dashboard and concepts
  5. Privacy Considerations: Prompts/responses sent to tracking server (use self-hosted)

Trade-offs

Considered Alternatives:

  1. Manual CSV Logging
  2. ❌ No real-time visibility
  3. ❌ Manual analysis required
  4. ✅ Simple, no dependencies
  5. ❌ No prompt caching metrics

  6. Weave (Weights & Biases)

  7. ✅ Real-time dashboards
  8. ✅ Multi-agent support
  9. ❌ Requires W&B account (vendor lock-in)
  10. ❌ Higher pricing for production use

  11. AgentOps

  12. ✅ Agent-first platform
  13. ✅ Good visualizations
  14. ❌ Closed source
  15. ❌ Limited prompt caching support

  16. LangFuse (Chosen)

  17. ✅ Open source, self-hostable
  18. ✅ Excellent prompt caching metrics
  19. ✅ Real-time dashboards
  20. ✅ Low overhead (<10ms)
  21. ⚠️ Self-hosting required

  22. LangSmith

  23. ✅ LangChain ecosystem integration
  24. ✅ Good tracing
  25. ❌ Vendor lock-in (LangChain)
  26. ❌ Limited Anthropic support

Why We Chose LangFuse: - Open source enables self-hosting (data privacy) - Best-in-class prompt caching metrics (critical for cost optimization) - Low overhead suitable for production - Active development and community

Implementation Timeline

  • October 12: Evaluated LangFuse, Weave, AgentOps
  • October 14: Self-hosted LangFuse via Docker Compose
  • October 16: Implemented CostTracker class
  • October 18: Integrated tracking into ClaudeSDKExecutor
  • October 20: Added cache hit rate metrics
  • October 22: Configured real-time alerts (>$50/day)
  • October 25: Validated $4,380/year savings calculation

Validation

Tested LangFuse integration with: - ✅ Multi-agent workflows (10+ agents tracked) - ✅ Cache hit rate >70% (achieved 72%) - ✅ Cost attribution per-agent (identified top 3 consumers) - ✅ Real-time dashboard (sub-second updates) - ✅ Graceful degradation (tracking failure doesn't break workflows) - ✅ Production CI/CD (no performance impact)

Cost Savings Evidence

Measured Savings (October 2025):

Optimization Daily Savings Annual Savings
Prompt caching (72% hit rate) $4.50 $1,642
Haiku for workers (60% usage) $6.20 $2,263
SDK MCP (faster execution) $1.30 $475
Total $12.00 $4,380

Per-Developer Impact: - Before: ~\(30/day (untracked, unoptimized) - After: ~\)18/day (tracked, optimized) - Savings: 40% cost reduction with same quality

References

  • Implementation: src/orchestrator/cost_tracker.py
  • Executor Integration: src/orchestrator/claude_sdk_executor.py
  • LangFuse Setup: docs/langfuse-setup.md
  • Dashboard: http://localhost:3000 (local deployment)
  • Test Suite: test_cost_tracking.py
  • Documentation: COST_TRACKING_GUIDE.md