This guide covers CoAgent's comprehensive monitoring and observability capabilities, helping you track performance, detect anomalies, optimize costs, and maintain reliable AI agent operations.
Overview
CoAgent provides a complete observability platform that includes:
- Real-time Monitoring: Live performance tracking and dashboards 
- Structured Logging: Comprehensive event tracking with structured data 
- Performance Analytics: Response times, token usage, and cost analysis 
- Anomaly Detection: Automatic detection of unusual patterns and issues 
- Multi-Profile Management: Organized monitoring across different environments 
- Drill-down Analysis: From high-level metrics to detailed execution traces 
Monitoring Architecture
Core Components
CoAgent's monitoring system consists of several interconnected layers:
┌─────────────────────────────────────────────────────────────┐
│                    Web UI Dashboard                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │  Overview   │  │    Runs     │  │    Comparisons      │  │
│  │  Dashboard  │  │   Viewer    │  │     & Analysis      │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
           │                    │                    │
           ▼                    ▼                    ▼
┌─────────────────────────────────────────────────────────────┐
│                  REST API Layer                             │
│  /api/v1/logs  •  /api/v1/runs  •  /api/v1/monitoring     │
└─────────────────────────────────────────────────────────────┘
           │                    │                    │
           ▼                    ▼                    ▼
┌─────────────────────────────────────────────────────────────┐
│                 Storage & Analytics                         │
│  Structured Logs  •  Metrics Store  •  Anomaly Detection  │
└─────────────────────────────────────────────────────────────┘
           │                    │                    │
           ▼                    ▼                    ▼
┌─────────────────────────────────────────────────────────────┐
│                   Data Sources                              │
│   Python Client   •   Rust Client   •   Test Studio      │
│      Sandbox      •   External APIs  •   Manual Logs      │
└─────────────────────────────────────────────────────────────┘Monitoring Profiles
CoAgent organizes monitoring data into profiles:
- Sandbox Profile: Automatically monitors sandbox interactions 
- Test Studio Profile: Tracks test execution and results 
- External Profiles: Monitor external systems via API integration 
- Aggregate View: Combined view across all profiles 
Getting Started with Monitoring
Accessing the Monitoring Dashboard
- Navigate to Monitoring: Open your browser to - http://localhost:3000/monitoring
 
- Default View: You'll see the aggregate dashboard showing data across all profiles 
- Profile Selection: Use the profile selector to focus on specific monitoring contexts 
Understanding the Interface
Navigation Structure
Home > Monitoring > [Profile] > [Section]
Key Sections
- Overview: High-level metrics and recent activity 
- Runs: Detailed execution logs and filtering 
- Performance: Response times and efficiency metrics 
- Costs: Token usage and spending analysis 
- Anomalies: Automatically detected issues 
Structured Logging System
Log Entry Types
CoAgent captures comprehensive event data through structured logging:
Session Events
{
  "event_type": "session_start",
  "session_id": "run-12345",
  "timestamp": "2025-01-16T17:30:00Z",
  "meta": {
    "agent_config": "customer-support-gpt4",
    "user_context": "web_chat"
  }
}LLM Interactions
{
  "event_type": "llm_call",
  "session_id": "run-12345",
  "prompt": "Help me return a product",
  "system_prompt": "You are a helpful customer support agent...",
  "model": "gpt-4",
  "timestamp": "2025-01-16T17:30:05Z"
}{
  "event_type": "llm_response", 
  "session_id": "run-12345",
  "response": "I'd be happy to help you with your return...",
  "input_tokens": 245,
  "output_tokens": 156,
  "total_tokens": 401,
  "timestamp": "2025-01-16T17:30:08Z"
}Tool Execution
{
  "event_type": "tool_call",
  "session_id": "run-12345",
  "tool_name": "order_lookup",
  "parameters": {"order_id": "ORD-789"},
  "timestamp": "2025-01-16T17:30:09Z"
}{
  "event_type": "tool_response",
  "session_id": "run-12345", 
  "tool_name": "order_lookup",
  "result": {"status": "shipped", "tracking": "TRK-456"},
  "execution_time_ms": 245,
  "success": true,
  "timestamp": "2025-01-16T17:30:10Z"
}Error Events
{
  "event_type": "error",
  "session_id": "run-12345",
  "error_info": {
    "severity": "medium",
    "message": "Rate limit exceeded",
    "error_code": "RATE_LIMIT_429",
    "recovery_attempted": true
  },
  "timestamp": "2025-01-16T17:30:15Z"
}Logging from Applications
Python Client Integration
from coagent import Coagent
from coagent_types import CoagentConfig, LoggerConfig
config = CoagentConfig(
    model_name="gpt-4",
    logger_config=LoggerConfig(
        base_url="http://localhost:3000",
        enabled=True
    )
)
agent = Coagent(config)
response = agent.process_prompt("What's the weather like today?")Rust Client Integration
use coagent_client::{CoaClient, LogEntry, LogEntryHeader, UserInputLog};
use serde_json::json;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = CoaClient::new("http://localhost:3000/api")?;
    
    
    let log_entry = LogEntry::UserInput(UserInputLog {
        hdr: LogEntryHeader {
            run_id: "custom-run-456".to_string(),
            timestamp: chrono::Utc::now().to_rfc3339(),
            meta: json!({
                "source": "external_api",
                "user_id": "user_123"
            }),
        },
        content: "Customer inquiry about order status".to_string(),
    });
    
    client.log_entry(log_entry).await?;
    Ok(())
}Performance Monitoring
Key Metrics
CoAgent tracks comprehensive performance metrics:
Response Time Metrics
- Average Response Time: Mean time from prompt to response 
- 95th Percentile: Response time for 95% of requests 
- Response Time Distribution: Histogram of response times 
- Trend Analysis: Response time changes over time 
Token Usage Metrics
- Input Tokens: Tokens consumed by prompts and context 
- Output Tokens: Tokens generated in responses 
- Token Efficiency: Output/Input token ratio 
- Model-specific Usage: Token consumption by model type 
Success Rate Metrics
- Overall Success Rate: Percentage of successful requests 
- Error Rate by Type: Breakdown of error categories 
- Tool Call Success: Success rate of tool executions 
- Recovery Rate: Successful error recovery attempts 
Performance Analysis Dashboard
Overview Cards
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Total Requests  │ │  Avg Response   │ │   Total Tokens  │ │ Estimated Cost  │
│     12,847      │ │     1.8s        │ │   2.4M tokens   │ │     $145.23     │
│  ↑ 15% vs prev  │ │  ↓ 0.2s vs prev │ │ ↑ 12% vs prev   │ │  ↑ 8% vs prev   │
└─────────────────┘ └─────────────────┘ └─────────────────┘ └─────────────────┘
Performance Charts
- Requests Over Time: Line chart showing request volume 
- Response Time Trends: Response time evolution 
- Token Usage Patterns: Daily/hourly token consumption 
- Error Rate Monitoring: Error frequency and types 
Performance Optimization
Response Time Optimization
Identify Slow Requests:
curl "http://localhost:3000/api/v1/logs?filter=duration_gt:5000&sort=duration_desc"
Common Causes & Solutions:
- Large Context Windows: Reduce prompt length, implement summarization 
- Complex Tool Calls: Optimize tool execution, implement caching 
- Model Selection: Use faster models for simple tasks 
- Token Limits: Reduce max_tokens for quicker responses 
Monitor Tool Performance:
{
  "tool_name": "web_search",
  "avg_execution_time": 2500,
  "success_rate": 0.95,
  "calls_per_hour": 45
}Token Efficiency Improvements
Track Token Patterns:
- Monitor input/output ratios by agent type 
- Identify prompts with high token consumption 
- Analyze tool call overhead 
- Track model-specific efficiency 
Optimization Strategies:
- Prompt Engineering: Reduce unnecessary verbosity 
- Context Management: Clear context between conversations 
- Model Selection: Choose appropriate models for task complexity 
- Response Length Control: Set optimal max_tokens limits 
Cost Monitoring and Analysis
Cost Tracking Features
Real-time Cost Monitoring
- Current Spending: Today's costs across all agents 
- Budget Tracking: Compare against set budgets 
- Cost Projections: Predicted monthly spending based on trends 
- Model Cost Breakdown: Spending by model type 
Detailed Cost Analysis
{
  "cost_breakdown": {
    "by_model": {
      "gpt-4": {"cost": 89.45, "percentage": 65.2},
      "gpt-3.5-turbo": {"cost": 32.18, "percentage": 23.5},
      "claude-3-sonnet": {"cost": 15.67, "percentage": 11.3}
    },
    "by_agent": {
      "customer-support": {"cost": 78.32, "calls": 1234},
      "technical-docs": {"cost": 45.89, "calls": 567}, 
      "content-writer": {"cost": 23.09, "calls": 890}
    },
    "by_tool": {
      "web_search": {"cost": 12.45, "calls": 234},
      "database_query": {"cost": 8.76, "calls": 156}
    }
  }
}Cost Optimization Strategies
Model Tiering
def select_model_by_complexity(task_complexity):
    if task_complexity == "simple":
        return "gpt-3.5-turbo"  
    elif task_complexity == "moderate": 
        return "claude-3-haiku"  
    else:
        return "gpt-4"  Budget Alerts
Set up automatic alerts when spending exceeds thresholds:
def check_daily_budget():
    daily_cost = get_daily_spending()
    if daily_cost > DAILY_BUDGET * 0.8:
        send_alert(f"Daily spending at 80%: ${daily_cost}")
    if daily_cost > DAILY_BUDGET:
        send_alert(f"Daily budget exceeded: ${daily_cost}")Anomaly Detection
Automatic Anomaly Detection
CoAgent automatically identifies unusual patterns:
Performance Anomalies
- Response Time Spikes: Sudden increases in response latency 
- Success Rate Drops: Significant decreases in successful requests 
- Token Usage Anomalies: Unexpected changes in token consumption 
- Tool Call Failures: Unusual tool execution problems 
Usage Pattern Anomalies
- Traffic Spikes: Unusual increases in request volume 
- Model Usage Changes: Unexpected shifts in model selection 
- Error Pattern Changes: New or increased error types 
- Cost Anomalies: Spending significantly above or below trends 
Anomaly Examples
Performance Degradation Alert
{
  "anomaly_type": "performance_degradation",
  "severity": "high",
  "description": "Average response time increased by 250% in last hour",
  "detected_at": "2025-01-16T17:45:00Z",
  "metrics": {
    "current_avg": 4.2,
    "baseline_avg": 1.7,
    "affected_requests": 156
  },
  "recommended_actions": [
    "Check model provider status",
    "Review recent configuration changes",
    "Monitor tool execution times"
  ]
}Unusual Error Pattern
{
  "anomaly_type": "error_spike",
  "severity": "medium", 
  "description": "Rate limit errors increased by 500% in last 30 minutes",
  "detected_at": "2025-01-16T17:30:00Z",
  "metrics": {
    "error_count": 45,
    "baseline_count": 9,
    "affected_agents": ["customer-support", "technical-docs"]
  },
  "recommended_actions": [
    "Review API usage patterns",
    "Consider request rate limiting",
    "Check for unusual traffic sources"
  ]
}Advanced Monitoring Features
Drill-down Analysis
From Dashboard to Details
- Click Metric Card: Navigate to filtered runs view 
- Select Time Range: Focus on specific time periods 
- Filter by Criteria: Agent, model, status, etc. 
- View Individual Runs: Detailed execution traces 
Run Detail View
Run #REQ-5872 • 2025-01-16 17:30:21 • Status: Success
Metrics Summary:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│   Model     │ │  Duration   │ │    Tokens   │ │   Status    │
│   gpt-4     │ │    1.8s     │ │ 401 (245/156)│ │   Success   │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Event Timeline:
17:30:21.023  Session Start
17:30:21.045  User Input: "Help me return a product"
17:30:21.067  LLM Call: customer-support context
17:30:22.345  Tool Call: order_lookup(order_id="ORD-123")
17:30:22.590  Tool Response: {"status": "shipped", "eligible": true}
17:30:22.612  LLM Response: "I can help you with that return..."
17:30:22.634  Session EndComparison Analysis
Run Comparisons
Compare two specific runs side-by-side:
REQ-5872 vs REQ-5871
┌─────────────────────┬─────────────┬─────────────┐
│       Metric        │  REQ-5872   │  REQ-5871   │
├─────────────────────┼─────────────┼─────────────┤
│ Duration            │    1.8s     │    3.2s     │
│ Total Tokens        │     401     │     678     │
│ Tool Calls          │      1      │      3      │
│ Success             │     ✓       │     ✗       │
│ Cost                │   $0.024    │   $0.041    │
└─────────────────────┴─────────────┴─────────────┘
Agent Performance Comparison
comparison_results = {
    "gpt-4-conservative": {
        "avg_response_time": 1.2,
        "success_rate": 0.98,
        "cost_per_request": 0.045
    },
    "gpt-4-balanced": {
        "avg_response_time": 1.8, 
        "success_rate": 0.95,
        "cost_per_request": 0.038
    },
    "claude-3-sonnet": {
        "avg_response_time": 2.1,
        "success_rate": 0.96,
        "cost_per_request": 0.032
    }
}Real-time Monitoring
Live Dashboard Updates
- WebSocket Integration: Real-time data streaming 
- Auto-refresh: Configurable update intervals 
- Live Activity Feed: Recent requests as they occur 
- Alert Notifications: Real-time anomaly alerts 
Monitoring External Systems
import requests
def log_external_llm_call(api_key, call_data):
    response = requests.post(
        "http://localhost:3000/api/v1/logs",
        headers={"X-API-Key": api_key},
        json={
            "entry": {
                "event_type": "llm_call",
                "session_id": call_data["session_id"],
                "prompt": call_data["prompt"],
                "model": call_data["model"],
                "timestamp": call_data["timestamp"]
            }
        }
    )
    return response.json()Integration Patterns
CI/CD Integration
Monitoring Test Results
#!/bin/bash
TEST_RUN_ID="http://localhost:3000/api/v1/testsets/regression-suite/run"
echo "Monitoring test run: $TEST_RUN_ID"
while true; do
    STATUS="http://localhost:3000/api/v1/testruns/$TEST_RUN_ID"'.status'
    if [ "$STATUS" != "Running" ]; then break; fi
    
    
    METRICS="http://localhost:3000/api/v1/monitoring/profiles/test-studio/overview"
    echo "Current metrics: $METRICS'.performance'"
    
    sleep 30
donePerformance Regression Detection
def check_deployment_performance():
    current_metrics = get_current_metrics()
    baseline_metrics = get_baseline_metrics()
    
    performance_degradation = (
        current_metrics["avg_response_time"] > 
        baseline_metrics["avg_response_time"] * 1.2
    )
    
    if performance_degradation:
        raise Exception("Performance regression detected")Production Monitoring
Health Checks
def health_check():
    try:
        
        response = agent.process_prompt("health check")
        
        
        if response.metadata.get("duration_ms", 0) > 5000:
            return {"status": "degraded", "reason": "slow_response"}
            
        
        recent_runs = get_recent_runs(limit=100)
        success_rate = sum(1 for r in recent_runs if r.status == "success") / len(recent_runs)
        
        if success_rate < 0.95:
            return {"status": "degraded", "reason": "low_success_rate"}
            
        return {"status": "healthy"}
        
    except Exception as e:
        return {"status": "unhealthy", "reason": str(e)}Capacity Planning
def analyze_capacity_trends():
    metrics = get_monthly_metrics()
    
    growth_rate = calculate_growth_rate(metrics["request_volume"])
    cost_trend = calculate_cost_trend(metrics["spending"])
    
    
    projected_volume = project_future_volume(growth_rate)
    projected_cost = project_future_cost(cost_trend)
    
    return {
        "current_rps": metrics["requests_per_second"],
        "projected_rps": projected_volume["peak_rps"],
        "capacity_needed": projected_volume["peak_rps"] * 1.5,  
        "cost_projection": projected_cost
    }Best Practices
Monitoring Strategy
1. Establish Baselines
- Performance Baselines: Record typical response times and success rates 
- Cost Baselines: Track normal spending patterns 
- Usage Baselines: Understand typical request volumes and patterns 
2. Define SLAs
- Response Time: 95% of requests under 3 seconds 
- Success Rate: >99% successful completions 
- Availability: >99.9% system availability 
- Cost Control: Stay within monthly budget 
3. Alert Thresholds
- Critical: Service unavailable, success rate <95% 
- Warning: Response time >2x baseline, cost >80% of budget 
- Info: Usage patterns change, new error types appear 
Data Retention
Log Retention Policies
RETENTION_POLICY = {
    "detailed_logs": "30_days",      
    "aggregated_metrics": "1_year",   
    "cost_data": "3_years",          
    "anomaly_data": "6_months"       
}Archive Strategy
- Hot Data: Last 7 days - immediate access 
- Warm Data: Last 30 days - quick retrieval 
- Cold Data: Older than 30 days - archival storage 
- Cost Data: Retain for compliance and analysis 
Privacy and Security
Sensitive Data Handling
def sanitize_log_entry(entry):
    
    if "user_input" in entry:
        entry["user_input"] = sanitize_pii(entry["user_input"])
    
    
    if "session_id" in entry:
        entry["session_id"] = hash_session_id(entry["session_id"])
    
    return entryAccess Control
- Role-based Access: Different access levels for different users 
- API Key Management: Secure external system integration 
- Data Anonymization: Remove or hash PII in logs 
- Compliance: Meet GDPR, HIPAA, or other regulatory requirements 
Troubleshooting
Common Monitoring Issues
Missing Data
curl http://localhost:3000/api/v1/logs/health
grep -r "logger_config" /path/to/client/code
curl -v http://localhost:3000/api/v1/logs -X POST \
  -H "Content-Type: application/json" \
  -d '{"entry": {"event_type": "test", "session_id": "test-123"}}'Performance Issues
- Slow Dashboard Loading: Check database performance, consider caching 
- High Memory Usage: Review log retention policies, implement archiving 
- API Timeouts: Optimize queries, add request timeouts 
Data Inconsistencies
- Missing Events: Check for client-side errors, network issues 
- Incorrect Metrics: Verify aggregation logic, check for clock drift 
- Cost Discrepancies: Validate token counting, compare with provider bills 
Debugging Tools
Log Analysis
curl "http://localhost:3000/api/v1/logs?event_type=error&limit=100" | \
  jq '.[] | .error_info.message' | sort | uniq -c | sort -nr
curl "http://localhost:3000/api/v1/runs?limit=1000" | \
  jq '.[] | .total_time_ms' | sort -n | awk '{print NR, $1}'Custom Dashboards
Create specialized monitoring views for specific needs:
- Agent-specific Dashboards: Focus on individual agent performance 
- Cost Control Dashboards: Detailed spending analysis 
- Error Investigation Dashboards: Deep-dive into failure patterns 
- Capacity Planning Dashboards: Usage trends and projections 
Next Steps
- Agent Configuration Guide: Optimize agents for better monitoring 
- Testing and QA Guide: Integrate testing with monitoring 
- Python Client Tutorial: Implement monitoring in applications 
- Web UI Reference: Complete monitoring interface guide 
- REST API Reference: API endpoints for custom monitoring solutions