Search Docs…

Search Docs…

Guide

Testing AI Agents

This guide covers comprehensive testing strategies, quality assurance workflows, and best practices for ensuring reliable AI agent performance in CoAgent.

Overview

CoAgent provides a complete testing framework that includes:

  • Test Studio: Web-based interface for creating and managing tests

  • Automated Testing: API-driven test execution and validation

  • Multi-Agent Comparison: Side-by-side performance analysis

  • Comprehensive Validation: Multiple assertion types and criteria

  • Continuous Monitoring: Real-time performance tracking and anomaly detection

Testing Philosophy

The Testing Pyramid for AI Agents

                    🔍 Manual Testing
                 🧪 Integration Testing
            📊 Multi-Agent Comparison Testing  
       🎯 Regression Testing (Test Suites)
    📝 Unit Testing (Individual Test Cases)

Unit Testing: Individual test cases that validate specific agent behaviors Regression Testing: Test suites that prevent performance degradation Comparison Testing: Multi-agent tests that identify optimal configurations
Integration Testing: End-to-end workflows with external systems Manual Testing: Exploratory testing and user acceptance validation

Test Studio Overview

The Test Studio provides a comprehensive web-based testing environment accessible at http://localhost:3000/test-studio.

Key Components

  • Test Suites: Collections of related test cases

  • Test Cases: Individual test scenarios with inputs and validations

  • Assertions: Validation criteria for agent responses

  • Test Runs: Execution records with detailed results

  • Agent Comparisons: Side-by-side performance analysis

Creating Test Suites

Via Web UI

  1. Navigate to Test Studio

  2. Click "Create New Test Suite"

  3. Configure the test suite:

Name: Customer Support Validation
Description: Comprehensive tests for customer support agent performance
Agent Configurations: customer-support-gpt4, customer-support-claude

Via REST API

curl -X POST http://localhost:3000/api/v1/testsets \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Customer Support Validation",
    "description": "Comprehensive tests for customer support agent performance", 
    "bound_agent_name": "customer-support-gpt4",
    "cases": []
  }'

Creating Test Cases

Basic Test Case Structure

Each test case consists of:

  • Input: The prompt or scenario to test

  • Validations: Criteria for evaluating the response

  • Metadata: Additional context and configuration

Example Test Cases

1. Content Validation Test

{
  "input": {
    "human_prompt": "I want to return a product I bought last week"
  },
  "validations": [
    {
      "kind": {
        "ContentMatch": {
          "pattern": "(return policy|refund|exchange)"
        }
      }
    }
  ]
}

2. Tool Call Validation Test

{
  "input": {
    "human_prompt": "What's the status of order #12345?"
  },
  "validations": [
    {
      "kind": {
        "ToolCall": {
          "tool_name": "order_lookup"
        }
      }
    }
  ]
}

3. Response Schema Validation

{
  "input": {
    "human_prompt": "Generate a customer support ticket summary"
  },
  "validations": [
    {
      "kind": {
        "ResponseSchema": {
          "schema": {
            "type": "object",
            "properties": {
              "ticket_id": {"type": "string"},
              "priority": {"type": "string", "enum": ["low", "medium", "high"]},
              "summary": {"type": "string"}
            },
            "required": ["ticket_id", "priority", "summary"]
          }
        }
      }
    }
  ]
}

Validation Types

CoAgent supports multiple validation types to thoroughly test agent behavior.

1. Content Match Validation

Tests whether responses contain expected content patterns.

Substring Matching

{
  "ContentMatch": {
    "pattern": "thank you for contacting"
  }
}

Regular Expression Matching

{
  "ContentMatch": {
    "pattern": "order #\\d{5,8}"
  }
}

2. Tool Call Validation

Verifies that agents call appropriate tools during execution.

{
  "ToolCall": {
    "tool_name": "search_knowledge_base"
  }
}

3. Response Schema Validation

Ensures structured outputs match expected JSON schemas.

{
  "ResponseSchema": {
    "schema": {
      "type": "object",
      "properties": {
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "answer": {"type": "string"},
        "sources": {"type": "array", "items": {"type": "string"}}
      }
    }
  }
}

4. Response Time Validation

Validates that responses are generated within acceptable time limits.

{
  "ResponseTime": {
    "max_seconds": 5
  }
}

5. Semantic Similarity Validation

Compares response meaning to expected content using embedding similarity.

{
  "SemanticSimilarity": {
    "sentence": "I apologize for the inconvenience and will help resolve this issue",
    "threshold": 0.8
  }
}

6. LLM-Based Validation

Uses another LLM to evaluate response quality against specific criteria.

{
  "LlmV0": {
    "llm0": {
      "model_ref": {
        "provider_id": "openai-eval",
        "model_name": "gpt-4"
      },
      "criteria": "Rate whether the response is helpful, accurate, and professional on a scale of 1-10. A score of 8 or higher passes."
    }
  }
}

Running Tests

Single Test Suite Execution

Via Web UI

  1. Navigate to your test suite in Test Studio

  2. Click "Run Test Suite"

  3. Select sandbox configurations to test against

  4. Monitor execution progress in real-time

  5. Review detailed results when complete

Via REST API

curl -X POST "http://localhost:3000/api/v1/testsets/{testset_id}/run" \
  -H "Content-Type: application/json" \
  -d '{
    "selected_configs": ["sandbox-config-1", "sandbox-config-2"]
  }'

Multi-Agent Comparison

Compare multiple agent configurations simultaneously:

curl -X POST "http://localhost:3000/api/v1/testsets/{testset_id}/run" \
  -H "Content-Type: application/json" \
  -d '{
    "selected_configs": [
      "customer-support-gpt4-config",
      "customer-support-claude-config", 
      "customer-support-mistral-config"
    ]
  }'

Analyzing Test Results

Test Run Summary

Each test run provides comprehensive metrics:

  • Overall Status: Passed/Failed/Warning

  • Case Statistics: Total, passed, failed, warnings

  • Performance Metrics: Average response time, token usage

  • Agent Comparison: Side-by-side performance data

Individual Case Results

Drill down into specific test cases to see:

  • Input/Output: Original prompt and agent response

  • Assertion Results: Pass/fail status for each validation

  • Execution Details: Tool calls, timing, token usage

  • Agent Comparison: How different agents performed on the same test

Performance Analysis

Key metrics to monitor:

Response Quality Metrics

  • Pass Rate: Percentage of assertions that passed

  • Consistency: Variation in responses across multiple runs

  • Semantic Accuracy: How well responses match expected meaning

Performance Metrics

  • Response Time: Average and 95th percentile latency

  • Token Efficiency: Input/output token ratio

  • Tool Usage: Frequency and appropriateness of tool calls

Cost Metrics

  • Token Cost: Total spending per test run

  • Cost per Test Case: Average cost across test cases

  • Model Efficiency: Cost-to-quality ratio

Advanced Testing Strategies

1. Progressive Testing

Start with basic tests and gradually increase complexity:

Phase 1: Basic functionality tests
Phase 2: Edge case and error handling  
Phase 3: Performance and stress tests
Phase 4: Multi-agent comparison tests
Phase 5: Integration and end-to-end tests

2. Test Data Management

Synthetic Test Data Generation

# Generate test cases programmatically
test_cases = [
    {
        "input": {"human_prompt": f"Handle {scenario} for customer support"},
        "validations": [content_validation, tool_validation]
    }
    for scenario in ["refund request", "shipping inquiry", "product question"]
]

Real Data Integration

Use anonymized real user interactions:

{
  "input": {
    "human_prompt": "[anonymized real user query]"
  },
  "validations": [
    {
      "kind": {
        "LlmV0": {
          "criteria": "Evaluate if this response would satisfy the original user intent"
        }
      }
    }
  ]
}

3. Regression Testing

Maintain test suites that prevent performance degradation:

Version Comparison Tests

  • Compare current agent performance to baseline versions

  • Track metrics over time to identify trends

  • Set up automated alerts for significant performance drops

Feature Regression Prevention

  • Test core functionality after each configuration change

  • Validate that new features don't break existing capabilities

  • Maintain comprehensive test coverage for critical paths

Continuous Testing Integration

Automated Test Execution

Set up automated testing workflows:

#!/bin/bash
# Daily regression test script

# Run core functionality tests
TEST_RUN_ID=$(curl -X POST "http://localhost:3000/api/v1/testsets/core-tests/run" \
  -H "Content-Type: application/json" \
  -d '{"selected_configs": ["production-config"]}' \
  | jq -r '.id')

# Wait for completion
while [ "$(curl -s "http://localhost:3000/api/v1/testruns/$TEST_RUN_ID" | jq -r '.status')" = "Running" ]; do
  sleep 30
done

# Check results and alert if failures
FAILED_CASES=$(curl -s "http://localhost:3000/api/v1/testruns/$TEST_RUN_ID" | jq '.failed_cases')
if [ "$FAILED_CASES" -gt 0 ]; then
  echo "❌ Test failures detected: $FAILED_CASES failed cases"
  # Send alert notification
else
  echo "✅ All tests passed"
fi

Performance Monitoring Integration

Connect test results to monitoring systems:

# Example: Send test metrics to monitoring
import requests

def send_test_metrics(test_run_results):
    metrics = {
        'test_pass_rate': test_run_results['passed_cases'] / test_run_results['total_cases'],
        'avg_response_time': test_run_results['total_time_ms'] / test_run_results['total_cases'],
        'total_cost': calculate_token_cost(test_run_results['total_tokens'])
    }
    
    # Send to monitoring system
    requests.post('http://monitoring:8080/metrics', json=metrics)

Quality Assurance Best Practices

1. Test Design Principles

Comprehensive Coverage

  • Test happy paths and edge cases

  • Include error scenarios and boundary conditions

  • Validate both functional and non-functional requirements

Realistic Test Data

  • Use representative real-world scenarios

  • Include diverse input types and formats

  • Test with different user personas and contexts

Clear Expectations

  • Define specific, measurable success criteria

  • Use appropriate validation types for each test goal

  • Document test intent and expected outcomes

2. Test Maintenance

Regular Review and Updates

  • Review test cases monthly for relevance

  • Update validations based on agent improvements

  • Remove obsolete tests and add new scenarios

Test Data Freshness

  • Refresh test datasets regularly

  • Incorporate new real-world scenarios

  • Update expected outcomes based on changing requirements

3. Result Interpretation

Understanding Metrics

  • Focus on trends rather than individual failures

  • Consider context when interpreting results

  • Use multiple validation types for comprehensive assessment

Action on Results

  • Investigate consistent failures promptly

  • Use comparison results to guide optimization

  • Document and share insights across the team

Troubleshooting Common Issues

Test Execution Problems

Tests Failing to Start

# Check test suite configuration
curl http://localhost:3000/api/v1/testsets/{testset_id}

# Verify sandbox configurations exist
curl

Slow Test Execution

  • Check agent response times in monitoring

  • Reduce max_tokens if responses are too long

  • Verify tool providers are responding quickly

  • Consider using faster models for testing

Validation Issues

False Positives/Negatives

  • Review and refine validation criteria

  • Use multiple validation types for better accuracy

  • Consider semantic similarity for content validation

  • Test validation logic with known good/bad examples

Inconsistent Results

  • Check for non-deterministic agent behavior

  • Review temperature and other sampling parameters

  • Ensure test environment consistency

  • Consider multiple test runs for statistical significance

Integration with Development Workflow

Pre-deployment Testing

# Example CI/CD integration
name: Agent Testing
on:
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Start CoAgent
        run: docker-compose up -d
      
      - name: Run Core Tests
        run: |
          TEST_RUN=$(curl -X POST "http://localhost:3000/api/v1/testsets/core-suite/run")
          # Wait for completion and check results
      
      - name: Run Regression Tests
        run: |
          # Execute regression test suite

Performance Benchmarking

Establish baseline performance metrics:

# Benchmark different configurations
configs_to_test = [
    "gpt-4-conservative",
    "gpt-4-balanced", 
    "gpt-3.5-fast",
    "claude-3-sonnet"
]

benchmark_results = {}
for config in configs_to_test:
    result = run_test_suite("benchmark-suite", [config])
    benchmark_results[config] = {
        'pass_rate': result.pass_rate,
        'avg_response_time': result.avg_response_time,
        'cost_per_test': result.cost_per_test
    }

# Compare and choose optimal configuration

Next Steps