Testing AI Agents - ApiFlow Documentation

Search Docs…

Guide

Testing AI Agents

This guide covers comprehensive testing strategies, quality assurance workflows, and best practices for ensuring reliable AI agent performance in CoAgent.

Overview

CoAgent provides a complete testing framework that includes:

Test Studio: Web-based interface for creating and managing tests
Automated Testing: API-driven test execution and validation
Multi-Agent Comparison: Side-by-side performance analysis
Comprehensive Validation: Multiple assertion types and criteria
Continuous Monitoring: Real-time performance tracking and anomaly detection

Testing Philosophy

The Testing Pyramid for AI Agents

                    🔍 Manual Testing
                 🧪 Integration Testing
            📊 Multi-Agent Comparison Testing  
       🎯 Regression Testing (Test Suites)
    📝 Unit Testing (Individual Test Cases)

Unit Testing: Individual test cases that validate specific agent behaviors Regression Testing: Test suites that prevent performance degradation Comparison Testing: Multi-agent tests that identify optimal configurations
Integration Testing: End-to-end workflows with external systems Manual Testing: Exploratory testing and user acceptance validation

Test Studio Overview

The Test Studio provides a comprehensive web-based testing environment accessible at http://localhost:3000/test-studio.

Key Components

Test Suites: Collections of related test cases
Test Cases: Individual test scenarios with inputs and validations
Assertions: Validation criteria for agent responses
Test Runs: Execution records with detailed results
Agent Comparisons: Side-by-side performance analysis

Creating Test Suites

Via Web UI

Navigate to Test Studio
Click "Create New Test Suite"
Configure the test suite:

Name: Customer Support Validation
Description: Comprehensive tests for customer support agent performance
Agent Configurations: customer-support-gpt4, customer-support-claude

Via REST API

curl -X POST http://localhost:3000/api/v1/testsets \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Customer Support Validation",
    "description": "Comprehensive tests for customer support agent performance", 
    "bound_agent_name": "customer-support-gpt4",
    "cases": []
  }'

Creating Test Cases

Basic Test Case Structure

Each test case consists of:

Input: The prompt or scenario to test
Validations: Criteria for evaluating the response
Metadata: Additional context and configuration

Example Test Cases

1. Content Validation Test

{
  "input": {
    "human_prompt": "I want to return a product I bought last week"
  },
  "validations": [
    {
      "kind": {
        "ContentMatch": {
          "pattern": "(return policy|refund|exchange)"
        }
      }
    }
  ]
}

2. Tool Call Validation Test

{
  "input": {
    "human_prompt": "What's the status of order #12345?"
  },
  "validations": [
    {
      "kind": {
        "ToolCall": {
          "tool_name": "order_lookup"
        }
      }
    }
  ]
}

3. Response Schema Validation

{
  "input": {
    "human_prompt": "Generate a customer support ticket summary"
  },
  "validations": [
    {
      "kind": {
        "ResponseSchema": {
          "schema": {
            "type": "object",
            "properties": {
              "ticket_id": {"type": "string"},
              "priority": {"type": "string", "enum": ["low", "medium", "high"]},
              "summary": {"type": "string"}
            },
            "required": ["ticket_id", "priority", "summary"]
          }
        }
      }
    }
  ]
}

Validation Types

CoAgent supports multiple validation types to thoroughly test agent behavior.

1. Content Match Validation

Tests whether responses contain expected content patterns.

Substring Matching

{
  "ContentMatch": {
    "pattern": "thank you for contacting"
  }
}

Regular Expression Matching

{
  "ContentMatch": {
    "pattern": "order #\\d{5,8}"
  }
}

2. Tool Call Validation

Verifies that agents call appropriate tools during execution.

{
  "ToolCall": {
    "tool_name": "search_knowledge_base"
  }
}

3. Response Schema Validation

Ensures structured outputs match expected JSON schemas.

{
  "ResponseSchema": {
    "schema": {
      "type": "object",
      "properties": {
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "answer": {"type": "string"},
        "sources": {"type": "array", "items": {"type": "string"}}
      }
    }
  }
}

4. Response Time Validation

Validates that responses are generated within acceptable time limits.

{
  "ResponseTime": {
    "max_seconds": 5
  }
}

5. Semantic Similarity Validation

Compares response meaning to expected content using embedding similarity.

{
  "SemanticSimilarity": {
    "sentence": "I apologize for the inconvenience and will help resolve this issue",
    "threshold": 0.8
  }
}

6. LLM-Based Validation

Uses another LLM to evaluate response quality against specific criteria.

{
  "LlmV0": {
    "llm0": {
      "model_ref": {
        "provider_id": "openai-eval",
        "model_name": "gpt-4"
      },
      "criteria": "Rate whether the response is helpful, accurate, and professional on a scale of 1-10. A score of 8 or higher passes."
    }
  }
}

Running Tests

Single Test Suite Execution

Via Web UI

Navigate to your test suite in Test Studio
Click "Run Test Suite"
Select sandbox configurations to test against
Monitor execution progress in real-time
Review detailed results when complete

Via REST API

curl -X POST "http://localhost:3000/api/v1/testsets/{testset_id}/run" \
  -H "Content-Type: application/json" \
  -d '{
    "selected_configs": ["sandbox-config-1", "sandbox-config-2"]
  }'

Multi-Agent Comparison

Compare multiple agent configurations simultaneously:

curl -X POST "http://localhost:3000/api/v1/testsets/{testset_id}/run" \
  -H "Content-Type: application/json" \
  -d '{
    "selected_configs": [
      "customer-support-gpt4-config",
      "customer-support-claude-config", 
      "customer-support-mistral-config"
    ]
  }'

Analyzing Test Results

Test Run Summary

Each test run provides comprehensive metrics:

Overall Status: Passed/Failed/Warning
Case Statistics: Total, passed, failed, warnings
Performance Metrics: Average response time, token usage
Agent Comparison: Side-by-side performance data

Individual Case Results

Drill down into specific test cases to see:

Input/Output: Original prompt and agent response
Assertion Results: Pass/fail status for each validation
Execution Details: Tool calls, timing, token usage
Agent Comparison: How different agents performed on the same test

Performance Analysis

Key metrics to monitor:

Response Quality Metrics

Pass Rate: Percentage of assertions that passed
Consistency: Variation in responses across multiple runs
Semantic Accuracy: How well responses match expected meaning

Performance Metrics

Response Time: Average and 95th percentile latency
Token Efficiency: Input/output token ratio
Tool Usage: Frequency and appropriateness of tool calls

Cost Metrics

Token Cost: Total spending per test run
Cost per Test Case: Average cost across test cases
Model Efficiency: Cost-to-quality ratio

Advanced Testing Strategies

1. Progressive Testing

Start with basic tests and gradually increase complexity:

Phase 1: Basic functionality tests
Phase 2: Edge case and error handling  
Phase 3: Performance and stress tests
Phase 4: Multi-agent comparison tests
Phase 5: Integration and end-to-end tests

2. Test Data Management

Synthetic Test Data Generation

# Generate test cases programmatically
test_cases = [
    {
        "input": {"human_prompt": f"Handle {scenario} for customer support"},
        "validations": [content_validation, tool_validation]
    }
    for scenario in ["refund request", "shipping inquiry", "product question"]
]

Real Data Integration

Use anonymized real user interactions:

{
  "input": {
    "human_prompt": "[anonymized real user query]"
  },
  "validations": [
    {
      "kind": {
        "LlmV0": {
          "criteria": "Evaluate if this response would satisfy the original user intent"
        }
      }
    }
  ]
}

3. Regression Testing

Maintain test suites that prevent performance degradation:

Version Comparison Tests

Compare current agent performance to baseline versions
Track metrics over time to identify trends
Set up automated alerts for significant performance drops

Feature Regression Prevention

Test core functionality after each configuration change
Validate that new features don't break existing capabilities
Maintain comprehensive test coverage for critical paths

Continuous Testing Integration

Automated Test Execution

Set up automated testing workflows:

#!/bin/bash
# Daily regression test script

# Run core functionality tests
TEST_RUN_ID=$(curl -X POST "http://localhost:3000/api/v1/testsets/core-tests/run" \
  -H "Content-Type: application/json" \
  -d '{"selected_configs": ["production-config"]}' \
  | jq -r '.id')

# Wait for completion
while [ "$(curl -s "http://localhost:3000/api/v1/testruns/$TEST_RUN_ID" | jq -r '.status')" = "Running" ]; do
  sleep 30
done

# Check results and alert if failures
FAILED_CASES=$(curl -s "http://localhost:3000/api/v1/testruns/$TEST_RUN_ID" | jq '.failed_cases')
if [ "$FAILED_CASES" -gt 0 ]; then
  echo "❌ Test failures detected: $FAILED_CASES failed cases"
  # Send alert notification
else
  echo "✅ All tests passed"
fi

Performance Monitoring Integration

Connect test results to monitoring systems:

# Example: Send test metrics to monitoring
import requests

def send_test_metrics(test_run_results):
    metrics = {
        'test_pass_rate': test_run_results['passed_cases'] / test_run_results['total_cases'],
        'avg_response_time': test_run_results['total_time_ms'] / test_run_results['total_cases'],
        'total_cost': calculate_token_cost(test_run_results['total_tokens'])
    }
    
    # Send to monitoring system
    requests.post('http://monitoring:8080/metrics', json=metrics)

Quality Assurance Best Practices

1. Test Design Principles

Comprehensive Coverage

Test happy paths and edge cases
Include error scenarios and boundary conditions
Validate both functional and non-functional requirements

Realistic Test Data

Use representative real-world scenarios
Include diverse input types and formats
Test with different user personas and contexts

Clear Expectations

Define specific, measurable success criteria
Use appropriate validation types for each test goal
Document test intent and expected outcomes

2. Test Maintenance

Regular Review and Updates

Review test cases monthly for relevance
Update validations based on agent improvements
Remove obsolete tests and add new scenarios

Test Data Freshness

Refresh test datasets regularly
Incorporate new real-world scenarios
Update expected outcomes based on changing requirements

3. Result Interpretation

Understanding Metrics

Focus on trends rather than individual failures
Consider context when interpreting results
Use multiple validation types for comprehensive assessment

Action on Results

Investigate consistent failures promptly
Use comparison results to guide optimization
Document and share insights across the team

Troubleshooting Common Issues

Test Execution Problems

Tests Failing to Start

# Check test suite configuration
curl http://localhost:3000/api/v1/testsets/{testset_id}

# Verify sandbox configurations exist
curl

Slow Test Execution

Check agent response times in monitoring
Reduce max_tokens if responses are too long
Verify tool providers are responding quickly
Consider using faster models for testing

Validation Issues

False Positives/Negatives

Review and refine validation criteria
Use multiple validation types for better accuracy
Consider semantic similarity for content validation
Test validation logic with known good/bad examples

Inconsistent Results

Check for non-deterministic agent behavior
Review temperature and other sampling parameters
Ensure test environment consistency
Consider multiple test runs for statistical significance

Integration with Development Workflow

Pre-deployment Testing

# Example CI/CD integration
name: Agent Testing
on:
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Start CoAgent
        run: docker-compose up -d
      
      - name: Run Core Tests
        run: |
          TEST_RUN=$(curl -X POST "http://localhost:3000/api/v1/testsets/core-suite/run")
          # Wait for completion and check results
      
      - name: Run Regression Tests
        run: |
          # Execute regression test suite

Performance Benchmarking

Establish baseline performance metrics:

# Benchmark different configurations
configs_to_test = [
    "gpt-4-conservative",
    "gpt-4-balanced", 
    "gpt-3.5-fast",
    "claude-3-sonnet"
]

benchmark_results = {}
for config in configs_to_test:
    result = run_test_suite("benchmark-suite", [config])
    benchmark_results[config] = {
        'pass_rate': result.pass_rate,
        'avg_response_time': result.avg_response_time,
        'cost_per_test': result.cost_per_test
    }

# Compare and choose optimal configuration

Next Steps

Multi-Agent Testing Tutorial: Hands-on testing pipeline walkthrough
Python Client Tutorial: Build agents with integrated testing
Web UI Reference: Complete Test Studio interface guide
REST API Reference: API endpoints for test automation

Agent Configuration and Management

Monitoring and Observability for AI Agents