Multi-agent testing using Python - ApiFlow Documentation

Search Docs…

Tutorials

Multi-agent testing using Python

This tutorial guides you through creating a comprehensive multi-agent testing pipeline using CoAgent's Test Studio. You'll learn to compare different agent configurations, analyze performance metrics, and build automated testing workflows.

What You'll Build

By the end of this tutorial, you'll have:

A complete test suite with multiple validation types
Automated testing pipeline comparing different agent configurations
Performance analysis dashboard and reporting
Continuous integration testing setup
Best practices for multi-agent evaluation

Prerequisites

CoAgent running locally (docker-compose up)
Completed Getting Started Guide
Basic understanding of agent configurations
Familiarity with Test Studio interface (recommended)

Tutorial Overview

Phase 1: Test Suite Design & Creation
Phase 2: Multi-Agent Configuration Setup
Phase 3: Comprehensive Test Cases
Phase 4: Automated Test Execution
Phase 5: Performance Analysis & Comparison
Phase 6: CI/CD Integration
Phase 7: Advanced Testing Strategies

Phase 1: Test Suite Design & Creation

1.1 Define Testing Objectives

For this tutorial, we'll create a customer support agent testing pipeline that evaluates:

Response Quality: How helpful and accurate are the responses?
Consistency: Do agents give similar quality across different prompts?
Performance: Response times and token efficiency
Tool Usage: Appropriate use of available tools
Error Handling: Graceful handling of edge cases

1.2 Create the Test Suite Structure

Navigate to the Test Studio at http://localhost:3000/test-studio and create a new test suite:

{
  "name": "Customer Support Agent Evaluation",
  "description": "Comprehensive testing suite for customer support agent configurations across different scenarios and performance metrics",
  "test_categories": [
    "basic_functionality",
    "edge_cases", 
    "performance",
    "tool_integration",
    "error_handling"
  ]
}

1.3 Design Test Case Templates

Create a structured approach to test case design:

Template 1: Basic Functionality Tests

{
  "category": "basic_functionality",
  "input_template": "[SCENARIO] for customer support",
  "validations": [
    {"type": "content_match", "criteria": "helpful_response_patterns"},
    {"type": "response_time", "max_seconds": 5},
    {"type": "semantic_similarity", "benchmark": "ideal_response"}
  ]
}

Template 2: Tool Integration Tests

{
  "category": "tool_integration",
  "input_template": "[SCENARIO] requiring [TOOL_NAME]",
  "validations": [
    {"type": "tool_call", "expected_tool": "[TOOL_NAME]"},
    {"type": "response_schema", "schema": "tool_response_format"},
    {"type": "content_match", "criteria": "tool_result_usage"}
  ]
}

Phase 2: Multi-Agent Configuration Setup

2.1 Create Test Agent Configurations

We'll create multiple agent configurations to compare:

Agent 1: Conservative Support (GPT-4)

curl -X POST http://localhost:3000/api/v1/agents \
  -H "Content-Type: application/json" \
  -d '{
    "name": "conservative-support",
    "description": "Conservative customer support agent focused on accuracy",
    "preamble": "You are a careful, professional customer support agent. Always verify information before responding. Prioritize accuracy over speed. When in doubt, escalate to human support."
  }'

Agent 2: Fast Support (GPT-3.5 Turbo)

curl -X POST http://localhost:3000/api/v1/agents \
  -H "Content-Type: application/json" \
  -d '{
    "name": "fast-support",
    "description": "Quick-response customer support agent for high volume",
    "preamble": "You are an efficient customer support agent. Provide quick, helpful responses. Be concise but friendly. Handle common issues directly."
  }'

Agent 3: Empathetic Support (Claude-3 Sonnet)

curl -X POST http://localhost:3000/api/v1/agents \
  -H "Content-Type: application/json" \
  -d '{
    "name": "empathetic-support", 
    "description": "Empathetic customer support agent focused on customer satisfaction",
    "preamble": "You are a warm, empathetic customer support agent. Focus on understanding the customer's feelings and concerns. Provide emotional support along with practical solutions."
  }'

2.2 Set Up Model Providers

Create model providers for different LLM services:

# OpenAI Provider
curl -X POST http://localhost:3000/api/v1/providers \
  -H "Content-Type: application/json" \
  -d '{
    "name": "OpenAI Production",
    "provider_type": "openai",
    "api_key": "your-openai-key",
    "available_models": ["gpt-4", "gpt-3.5-turbo"]
  }'

# Anthropic Provider  
curl -X POST http://localhost:3000/api/v1/providers \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Anthropic Claude",
    "provider_type": "anthropic",
    "api_key": "your-anthropic-key", 
    "available_models": ["claude-3-sonnet", "claude-3-haiku"]
  }'

2.3 Create Bound Agents

Link agent configurations with model providers:

# Conservative Support with GPT-4
curl -X POST http://localhost:3000/api/v1/bound_agents \
  -H "Content-Type: application/json" \
  -d '{
    "name": "conservative-support-gpt4",
    "description": "Conservative support agent using GPT-4",
    "agent_config_name": "conservative-support",
    "model_provider_name": "OpenAI Production"
  }'

# Fast Support with GPT-3.5
curl -X POST http://localhost:3000/api/v1/bound_agents \
  -H "Content-Type: application/json" \
  -d '{
    "name": "fast-support-gpt35",
    "description": "Fast support agent using GPT-3.5 Turbo",
    "agent_config_name": "fast-support", 
    "model_provider_name": "OpenAI Production"
  }'

# Empathetic Support with Claude
curl -X POST http://localhost:3000/api/v1/bound_agents \
  -H "Content-Type: application/json" \
  -d '{
    "name": "empathetic-support-claude",
    "description": "Empathetic support agent using Claude-3 Sonnet",
    "agent_config_name": "empathetic-support",
    "model_provider_name": "Anthropic Claude" 
  }'

2.4 Create Sandbox Configurations

Set up different runtime environments for testing:

#!/usr/bin/env python3
"""
Create sandbox configurations for multi-agent testing
"""

import requests
import json

def create_sandbox_configs():
    """Create sandbox configurations for different testing scenarios."""
    
    base_url = "http://localhost:3000/api/v1/sandbox-configs"
    
    configs = [
        {
            "name": "Conservative Testing Environment",
            "description": "High accuracy, longer response times acceptable",
            "system_prompt": "Focus on providing accurate, well-researched responses. Take time to verify information.",
            "parameters": {
                "temperature": 0.2,
                "max_tokens": 1024,
                "top_p": 0.8
            },
            "tools": [],
            "category": "accuracy_focused"
        },
        {
            "name": "Fast Response Environment", 
            "description": "Quick responses, optimized for speed",
            "system_prompt": "Provide quick, efficient responses. Be concise and direct.",
            "parameters": {
                "temperature": 0.3,
                "max_tokens": 512,
                "top_p": 0.9
            },
            "tools": [],
            "category": "speed_focused"
        },
        {
            "name": "Balanced Environment",
            "description": "Balanced approach between accuracy and speed",
            "system_prompt": "Balance accuracy with efficiency. Provide helpful responses in reasonable time.",
            "parameters": {
                "temperature": 0.4,
                "max_tokens": 768,
                "top_p": 0.95
            },
            "tools": [
                {
                    "id": "knowledge-base-tools",
                    "tool_names": ["search_kb", "get_policy"]
                }
            ],
            "category": "balanced"
        }
    ]
    
    created_configs = []
    for config in configs:
        response = requests.post(base_url, json=config)
        if response.status_code == 200:
            created_configs.append(response.json())
            print(f"✅ Created config: {config['name']}")
        else:
            print(f"❌ Failed to create config: {config['name']}")
            print(f"   Error: {response.text}")
    
    return created_configs

if __name__ == "__main__":
    create_sandbox_configs()

Phase 3: Comprehensive Test Cases

3.1 Create Basic Functionality Tests

#!/usr/bin/env python3
"""
Create comprehensive test cases for customer support agent evaluation
"""

import requests
import json
from typing import List, Dict

class TestCaseBuilder:
    """Builder for creating structured test cases."""
    
    def __init__(self, base_url: str = "http://localhost:3000/api/v1"):
        self.base_url = base_url
        
    def create_test_suite(self, name: str, description: str) -> str:
        """Create a new test suite and return its ID."""
        
        response = requests.post(f"{self.base_url}/testsets", json={
            "name": name,
            "description": description,
            "cases": []
        })
        
        if response.status_code == 200:
            suite_id = response.json()["id_testset"]
            print(f"✅ Created test suite: {name} (ID: {suite_id})")
            return suite_id
        else:
            raise Exception(f"Failed to create test suite: {response.text}")
    
    def add_basic_functionality_tests(self, suite_id: str) -> List[Dict]:
        """Add basic functionality test cases."""
        
        test_cases = [
            {
                "name": "Product Return Inquiry",
                "input": "I want to return a product I bought last week",
                "validations": [
                    {
                        "type": "ContentMatch",
                        "pattern": "(return policy|return process|refund|exchange)"
                    },
                    {
                        "type": "ResponseTime", 
                        "max_seconds": 5
                    },
                    {
                        "type": "SemanticSimilarity",
                        "sentence": "I can help you with your return. Let me guide you through our return process.",
                        "threshold": 0.7
                    }
                ]
            },
            {
                "name": "Order Status Check",
                "input": "Can you check the status of my order #12345?",
                "validations": [
                    {
                        "type": "ToolCall",
                        "tool_name": "order_lookup"
                    },
                    {
                        "type": "ContentMatch",
                        "pattern": "(order|status|tracking|shipped|delivered)"
                    }
                ]
            },
            {
                "name": "Billing Question",
                "input": "I was charged twice for the same order",
                "validations": [
                    {
                        "type": "ContentMatch",
                        "pattern": "(billing|charge|refund|investigate|resolve)"
                    },
                    {
                        "type": "LlmV0",
                        "criteria": "Rate the empathy and helpfulness of the response on a scale of 1-10. A score of 7 or higher passes."
                    }
                ]
            },
            {
                "name": "Product Information Request",
                "input": "What are the technical specifications of the XZ-100 model?",
                "validations": [
                    {
                        "type": "ToolCall",
                        "tool_name": "product_lookup"
                    },
                    {
                        "type": "ResponseSchema",
                        "schema": {
                            "type": "object",
                            "properties": {
                                "product_name": {"type": "string"},
                                "specifications": {"type": "object"}
                            }
                        }
                    }
                ]
            }
        ]
        
        return self._add_test_cases(suite_id, test_cases)
    
    def add_edge_case_tests(self, suite_id: str) -> List[Dict]:
        """Add edge case and error handling test cases."""
        
        edge_cases = [
            {
                "name": "Angry Customer",
                "input": "This is ridiculous! Your product is broken and your service is terrible!",
                "validations": [
                    {
                        "type": "LlmV0",
                        "criteria": "Rate how well the response de-escalates the situation and shows empathy (1-10). Score 7+ passes."
                    },
                    {
                        "type": "ContentMatch",
                        "pattern": "(understand|sorry|apologize|help|resolve)"
                    }
                ]
            },
            {
                "name": "Ambiguous Request",
                "input": "I have a problem with my thing",
                "validations": [
                    {
                        "type": "ContentMatch",
                        "pattern": "(clarify|more information|specific|help understand)"
                    },
                    {
                        "type": "LlmV0",
                        "criteria": "Does the response appropriately ask for clarification? (Yes/No)"
                    }
                ]
            },
            {
                "name": "Out of Scope Request",
                "input": "Can you help me with my tax returns?", 
                "validations": [
                    {
                        "type": "ContentMatch",
                        "pattern": "(outside|scope|not able|cannot help|tax professional)"
                    }
                ]
            },
            {
                "name": "Multiple Issues",
                "input": "I need to return a product, update my address, and cancel my subscription",
                "validations": [
                    {
                        "type": "ContentMatch",
                        "pattern": "(return|address|subscription)"
                    },
                    {
                        "type": "LlmV0",
                        "criteria": "Does the response address all three issues mentioned? (Yes/No)"
                    }
                ]
            }
        ]
        
        return self._add_test_cases(suite_id, edge_cases)
    
    def add_performance_tests(self, suite_id: str) -> List[Dict]:
        """Add performance-focused test cases."""
        
        performance_tests = [
            {
                "name": "Quick Response Test",
                "input": "What are your business hours?",
                "validations": [
                    {
                        "type": "ResponseTime",
                        "max_seconds": 2
                    },
                    {
                        "type": "ContentMatch",
                        "pattern": "(hours|open|close|Monday|business)"
                    }
                ]
            },
            {
                "name": "Complex Query Efficiency",
                "input": "I need to return a product that was a gift, but I don't have the receipt, and it was purchased with store credit from a previous return. Can you help?",
                "validations": [
                    {
                        "type": "ResponseTime",
                        "max_seconds": 8
                    },
                    {
                        "type": "LlmV0",
                        "criteria": "Does the response efficiently address the complex return scenario? Rate 1-10, need 7+."
                    }
                ]
            }
        ]
        
        return self._add_test_cases(suite_id, performance_tests)
    
    def _add_test_cases(self, suite_id: str, test_cases: List[Dict]) -> List[Dict]:
        """Helper method to add test cases to a suite."""
        
        added_cases = []
        for test_case in test_cases:
            # Format for CoAgent API
            formatted_case = {
                "input": {
                    "human_prompt": test_case["input"]
                },
                "validations": []
            }
            
            # Convert validation types to CoAgent format
            for validation in test_case["validations"]:
                if validation["type"] == "ContentMatch":
                    formatted_case["validations"].append({
                        "kind": {
                            "ContentMatch": {
                                "pattern": validation["pattern"]
                            }
                        }
                    })
                elif validation["type"] == "ResponseTime":
                    formatted_case["validations"].append({
                        "kind": {
                            "ResponseTime": {
                                "max_seconds": validation["max_seconds"]
                            }
                        }
                    })
                elif validation["type"] == "ToolCall":
                    formatted_case["validations"].append({
                        "kind": {
                            "ToolCall": {
                                "tool_name": validation["tool_name"]
                            }
                        }
                    })
                elif validation["type"] == "SemanticSimilarity":
                    formatted_case["validations"].append({
                        "kind": {
                            "SemanticSimilarity": {
                                "sentence": validation["sentence"],
                                "threshold": validation["threshold"]
                            }
                        }
                    })
                elif validation["type"] == "ResponseSchema":
                    formatted_case["validations"].append({
                        "kind": {
                            "ResponseSchema": {
                                "schema": validation["schema"]
                            }
                        }
                    })
                elif validation["type"] == "LlmV0":
                    formatted_case["validations"].append({
                        "kind": {
                            "LlmV0": {
                                "llm0": {
                                    "model_ref": {
                                        "provider_id": "openai-eval",
                                        "model_name": "gpt-4"
                                    },
                                    "criteria": validation["criteria"]
                                }
                            }
                        }
                    })
            
            # Add to existing test suite
            response = requests.get(f"{self.base_url}/testsets/{suite_id}")
            if response.status_code == 200:
                suite_data = response.json()
                suite_data["cases"].append(formatted_case)
                
                # Update the suite
                update_response = requests.put(f"{self.base_url}/testsets/{suite_id}", json=suite_data)
                if update_response.status_code == 200:
                    added_cases.append(test_case)
                    print(f"✅ Added test case: {test_case['name']}")
                else:
                    print(f"❌ Failed to add test case: {test_case['name']}")
        
        return added_cases

def main():
    """Create the complete test suite."""
    
    builder = TestCaseBuilder()
    
    # Create main test suite
    suite_id = builder.create_test_suite(
        "Customer Support Multi-Agent Evaluation",
        "Comprehensive test suite for comparing customer support agent configurations across quality, performance, and reliability metrics"
    )
    
    # Add different categories of test cases
    basic_cases = builder.add_basic_functionality_tests(suite_id)
    edge_cases = builder.add_edge_case_tests(suite_id)
    performance_cases = builder.add_performance_tests(suite_id)
    
    total_cases = len(basic_cases) + len(edge_cases) + len(performance_cases)
    print(f"\n🎉 Test suite created successfully!")
    print(f"   Suite ID: {suite_id}")
    print(f"   Total test cases: {total_cases}")
    print(f"   Basic functionality: {len(basic_cases)}")
    print(f"   Edge cases: {len(edge_cases)}")
    print(f"   Performance tests: {len(performance_cases)}")

if __name__ == "__main__":
    main()

3.2 Run the Test Case Creation Script

Phase 4: Automated Test Execution

4.1 Create Test Execution Pipeline

#!/usr/bin/env python3
"""
Automated multi-agent testing pipeline
"""

import requests
import time
import json
from typing import Dict, List
from dataclasses import dataclass
from datetime import datetime

@dataclass
class TestConfiguration:
    """Configuration for a test run."""
    suite_id: str
    sandbox_configs: List[str]
    description: str
    expected_agents: List[str]

@dataclass 
class TestResults:
    """Results from a test execution."""
    run_id: str
    suite_id: str
    status: str
    total_cases: int
    passed_cases: int
    failed_cases: int
    warning_cases: int
    execution_time_ms: int
    agent_results: List[Dict]

class MultiAgentTestRunner:
    """Automated test runner for multi-agent comparison."""
    
    def __init__(self, base_url: str = "http://localhost:3000/api/v1"):
        self.base_url = base_url
        
    def run_multi_agent_test(self, config: TestConfiguration) -> TestResults:
        """Execute tests across multiple agent configurations."""
        
        print(f"🚀 Starting multi-agent test: {config.description}")
        print(f"   Suite ID: {config.suite_id}")
        print(f"   Sandbox configs: {config.sandbox_configs}")
        
        # Start the test run
        start_time = time.time()
        response = requests.post(
            f"{self.base_url}/testsets/{config.suite_id}/run",
            json={"selected_configs": config.sandbox_configs}
        )
        
        if response.status_code != 200:
            raise Exception(f"Failed to start test run: {response.text}")
        
        run_data = response.json()
        run_id = run_data.get("id") or run_data.get("run_id")
        print(f"   Run ID: {run_id}")
        
        # Monitor test execution
        print("⏳ Monitoring test execution...")
        results = self._monitor_test_execution(run_id)
        
        execution_time = time.time() - start_time
        print(f"✅ Test execution completed in {execution_time:.2f}s")
        
        return results
    
    def _monitor_test_execution(self, run_id: str) -> TestResults:
        """Monitor test execution until completion."""
        
        last_status = None
        start_time = time.time()
        
        while True:
            response = requests.get(f"{self.base_url}/testruns/{run_id}")
            if response.status_code != 200:
                raise Exception(f"Failed to get test run status: {response.text}")
            
            run_data = response.json()
            status = run_data.get("status")
            
            if status != last_status:
                print(f"   Status: {status}")
                last_status = status
            
            if status in ["Passed", "Failed", "Warning"]:
                # Test completed
                return TestResults(
                    run_id=run_id,
                    suite_id=run_data.get("suite_id"),
                    status=status,
                    total_cases=run_data.get("total_cases", 0),
                    passed_cases=run_data.get("passed_cases", 0),
                    failed_cases=run_data.get("failed_cases", 0),
                    warning_cases=run_data.get("warning_cases", 0),
                    execution_time_ms=run_data.get("total_time_ms", 0),
                    agent_results=run_data.get("agent_results", [])
                )
            
            elif status == "Running":
                # Still running, wait and check again
                time.sleep(10)
                
                # Print progress update
                elapsed = time.time() - start_time
                if elapsed > 0 and run_data.get("total_cases", 0) > 0:
                    completed = run_data.get("passed_cases", 0) + run_data.get("failed_cases", 0)
                    progress = completed / run_data.get("total_cases") * 100
                    print(f"   Progress: {progress:.1f}% ({completed}/{run_data.get('total_cases')}) - {elapsed:.1f}s elapsed")
            
            else:
                raise Exception(f"Unexpected test status: {status}")
    
    def generate_comparison_report(self, results: TestResults) -> Dict:
        """Generate a detailed comparison report."""
        
        report = {
            "summary": {
                "run_id": results.run_id,
                "status": results.status,
                "total_cases": results.total_cases,
                "execution_time_seconds": results.execution_time_ms / 1000,
                "overall_success_rate": results.passed_cases / results.total_cases if results.total_cases > 0 else 0
            },
            "agent_comparison": [],
            "performance_metrics": {},
            "recommendations": []
        }
        
        # Analyze each agent's performance
        for agent_result in results.agent_results:
            agent_analysis = self._analyze_agent_performance(agent_result)
            report["agent_comparison"].append(agent_analysis)
        
        # Generate performance insights
        report["performance_metrics"] = self._calculate_performance_metrics(results)
        
        # Generate recommendations
        report["recommendations"] = self._generate_recommendations(results)
        
        return report
    
    def _analyze_agent_performance(self, agent_result: Dict) -> Dict:
        """Analyze individual agent performance."""
        
        total_cases = agent_result.get("passed", 0) + agent_result.get("failed", 0) + agent_result.get("warnings", 0)
        success_rate = agent_result.get("passed", 0) / total_cases if total_cases > 0 else 0
        
        # Analyze case results for more detailed metrics
        case_results = agent_result.get("case_results", [])
        response_times = [case.get("response_time_ms", 0) for case in case_results]
        token_usage = [case.get("total_tokens", 0) for case in case_results]
        
        return {
            "sandbox_config_id": agent_result.get("sandbox_config_id"),
            "success_rate": success_rate,
            "total_cases": total_cases,
            "passed": agent_result.get("passed", 0),
            "failed": agent_result.get("failed", 0),
            "warnings": agent_result.get("warnings", 0),
            "performance": {
                "avg_response_time_ms": sum(response_times) / len(response_times) if response_times else 0,
                "max_response_time_ms": max(response_times) if response_times else 0,
                "avg_tokens": sum(token_usage) / len(token_usage) if token_usage else 0,
                "total_tokens": sum(token_usage)
            }
        }
    
    def _calculate_performance_metrics(self, results: TestResults) -> Dict:
        """Calculate cross-agent performance metrics."""
        
        all_response_times = []
        all_token_usage = []
        
        for agent_result in results.agent_results:
            case_results = agent_result.get("case_results", [])
            all_response_times.extend([case.get("response_time_ms", 0) for case in case_results])
            all_token_usage.extend([case.get("total_tokens", 0) for case in case_results])
        
        return {
            "response_time": {
                "average_ms": sum(all_response_times) / len(all_response_times) if all_response_times else 0,
                "median_ms": sorted(all_response_times)[len(all_response_times)//2] if all_response_times else 0,
                "p95_ms": sorted(all_response_times)[int(len(all_response_times)*0.95)] if all_response_times else 0
            },
            "token_usage": {
                "average_per_request": sum(all_token_usage) / len(all_token_usage) if all_token_usage else 0,
                "total_tokens": sum(all_token_usage)
            }
        }
    
    def _generate_recommendations(self, results: TestResults) -> List[str]:
        """Generate recommendations based on test results."""
        
        recommendations = []
        
        # Find best performing agent
        best_agent = None
        best_success_rate = 0
        
        for agent_result in results.agent_results:
            total = agent_result.get("passed", 0) + agent_result.get("failed", 0) + agent_result.get("warnings", 0)
            success_rate = agent_result.get("passed", 0) / total if total > 0 else 0
            
            if success_rate > best_success_rate:
                best_success_rate = success_rate
                best_agent = agent_result.get("sandbox_config_id")
        
        if best_agent:
            recommendations.append(f"Best performing configuration: {best_agent} with {best_success_rate:.1%} success rate")
        
        # Performance recommendations
        if results.execution_time_ms > 300000:  # 5 minutes
            recommendations.append("Consider optimizing response times - tests took longer than expected")
        
        if results.failed_cases > results.total_cases * 0.1:  # >10% failure rate
            recommendations.append("High failure rate detected - review failed test cases and agent configurations")
        
        return recommendations

def main():
    """Run the multi-agent testing pipeline."""
    
    runner = MultiAgentTestRunner()
    
    # Define test configuration
    test_config = TestConfiguration(
        suite_id="your-test-suite-id",  # Replace with actual suite ID
        sandbox_configs=[
            "conservative-config-id",
            "fast-config-id", 
            "balanced-config-id"
        ],
        description="Customer Support Agent Comparison Test",
        expected_agents=["conservative-support-gpt4", "fast-support-gpt35", "empathetic-support-claude"]
    )
    
    try:
        # Run the tests
        results = runner.run_multi_agent_test(test_config)
        
        # Generate and display report
        report = runner.generate_comparison_report(results)
        
        print("\n" + "="*60)
        print("🎯 MULTI-AGENT TEST RESULTS")
        print("="*60)
        
        print(f"\n📊 Summary:")
        print(f"   Status: {results.status}")
        print(f"   Total cases: {results.total_cases}")
        print(f"   Passed: {results.passed_cases}")
        print(f"   Failed: {results.failed_cases}")
        print(f"   Warnings: {results.warning_cases}")
        print(f"   Execution time: {results.execution_time_ms/1000:.2f}s")
        
        print(f"\n🤖 Agent Comparison:")
        for agent_analysis in report["agent_comparison"]:
            print(f"   {agent_analysis['sandbox_config_id']}:")
            print(f"     Success rate: {agent_analysis['success_rate']:.1%}")
            print(f"     Avg response time: {agent_analysis['performance']['avg_response_time_ms']:.0f}ms")
            print(f"     Avg tokens: {agent_analysis['performance']['avg_tokens']:.0f}")
        
        print(f"\n💡 Recommendations:")
        for rec in report["recommendations"]:
            print(f"   • {rec}")
        
        # Save detailed report
        with open(f"test_report_{results.run_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f:
            json.dump(report, f, indent=2)
        print(f"\n💾 Detailed report saved to: test_report_{results.run_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json")
        
    except Exception as e:
        print(f"❌ Test execution failed: {str(e)}")
        return False
    
    return True

if __name__ == "__main__":
    success = main()
    exit(0 if success else 1)

4.2 Execute Multi-Agent Tests

Phase 5: Performance Analysis & Comparison

5.1 Create Performance Analysis Dashboard

#!/usr/bin/env python3
"""
Performance analysis and visualization for multi-agent test results
"""

import json
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from typing import Dict, List
import requests

class PerformanceAnalyzer:
    """Analyze and visualize multi-agent test performance."""
    
    def __init__(self, base_url: str = "http://localhost:3000/api/v1"):
        self.base_url = base_url
        
    def analyze_test_run(self, run_id: str) -> Dict:
        """Perform comprehensive analysis of a test run."""
        
        # Get test run data
        response = requests.get(f"{self.base_url}/testruns/{run_id}")
        if response.status_code != 200:
            raise Exception(f"Failed to get test run: {response.text}")
        
        run_data = response.json()
        
        # Generate comprehensive analysis
        analysis = {
            "overview": self._analyze_overview(run_data),
            "agent_comparison": self._compare_agents(run_data),
            "performance_trends": self._analyze_performance_trends(run_data),
            "failure_analysis": self._analyze_failures(run_data),
            "recommendations": self._generate_insights(run_data)
        }
        
        return analysis
    
    def _analyze_overview(self, run_data: Dict) -> Dict:
        """Analyze overall test run metrics."""
        
        total_cases = run_data.get("total_cases", 0)
        passed = run_data.get("passed_cases", 0)
        failed = run_data.get("failed_cases", 0)
        warnings = run_data.get("warning_cases", 0)
        
        return {
            "total_cases": total_cases,
            "success_rate": passed / total_cases if total_cases > 0 else 0,
            "failure_rate": failed / total_cases if total_cases > 0 else 0,
            "warning_rate": warnings / total_cases if total_cases > 0 else 0,
            "execution_time_minutes": run_data.get("total_time_ms", 0) / 60000,
            "total_tokens": run_data.get("total_tokens", 0),
            "agents_tested": len(run_data.get("agent_results", []))
        }
    
    def _compare_agents(self, run_data: Dict) -> List[Dict]:
        """Compare performance across different agents."""
        
        agent_comparisons = []
        
        for agent_result in run_data.get("agent_results", []):
            config_id = agent_result.get("sandbox_config_id")
            
            # Calculate metrics
            total_cases = agent_result.get("passed", 0) + agent_result.get("failed", 0) + agent_result.get("warnings", 0)
            success_rate = agent_result.get("passed", 0) / total_cases if total_cases > 0 else 0
            
            # Analyze case-level performance
            case_results = agent_result.get("case_results", [])
            response_times = [case.get("response_time_ms", 0) for case in case_results if case.get("response_time_ms")]
            token_counts = [case.get("total_tokens", 0) for case in case_results if case.get("total_tokens")]
            
            agent_comparisons.append({
                "config_id": config_id,
                "success_rate": success_rate,
                "avg_response_time_ms": sum(response_times) / len(response_times) if response_times else 0,
                "p95_response_time_ms": sorted(response_times)[int(len(response_times) * 0.95)] if response_times else 0,
                "avg_tokens": sum(token_counts) / len(token_counts) if token_counts else 0,
                "total_tokens": sum(token_counts),
                "passed_cases": agent_result.get("passed", 0),
                "failed_cases": agent_result.get("failed", 0),
                "warning_cases": agent_result.get("warnings", 0)
            })
        
        return sorted(agent_comparisons, key=lambda x: x["success_rate"], reverse=True)
    
    def _analyze_performance_trends(self, run_data: Dict) -> Dict:
        """Analyze performance trends and patterns."""
        
        all_case_results = []
        for agent_result in run_data.get("agent_results", []):
            for case in agent_result.get("case_results", []):
                case["agent_config"] = agent_result.get("sandbox_config_id")
                all_case_results.append(case)
        
        if not all_case_results:
            return {"error": "No case results available for trend analysis"}
        
        # Convert to DataFrame for analysis
        df = pd.DataFrame(all_case_results)
        
        trends = {
            "response_time_distribution": {
                "mean": df["response_time_ms"].mean() if "response_time_ms" in df else 0,
                "std": df["response_time_ms"].std() if "response_time_ms" in df else 0,
                "min": df["response_time_ms"].min() if "response_time_ms" in df else 0,
                "max": df["response_time_ms"].max() if "response_time_ms" in df else 0
            },
            "token_usage_distribution": {
                "mean": df["total_tokens"].mean() if "total_tokens" in df else 0,
                "std": df["total_tokens"].std() if "total_tokens" in df else 0,
                "min": df["total_tokens"].min() if "total_tokens" in df else 0,
                "max": df["total_tokens"].max() if "total_tokens" in df else 0
            }
        }
        
        return trends
    
    def _analyze_failures(self, run_data: Dict) -> Dict:
        """Analyze failure patterns and common issues."""
        
        failure_patterns = {
            "total_failures": 0,
            "failure_by_agent": {},
            "common_failure_reasons": [],
            "failure_distribution": {}
        }
        
        for agent_result in run_data.get("agent_results", []):
            config_id = agent_result.get("sandbox_config_id")
            failed_cases = agent_result.get("failed", 0)
            
            failure_patterns["total_failures"] += failed_cases
            failure_patterns["failure_by_agent"][config_id] = failed_cases
            
            # Analyze individual case failures
            for case in agent_result.get("case_results", []):
                if case.get("status") == "Failed":
                    failure_reason = case.get("details", "Unknown failure")
                    failure_patterns["common_failure_reasons"].append(failure_reason)
        
        return failure_patterns
    
    def _generate_insights(self, run_data: Dict) -> List[str]:
        """Generate actionable insights from test results."""
        
        insights = []
        agent_results = run_data.get("agent_results", [])
        
        if not agent_results:
            return ["No agent results available for analysis"]
        
        # Find best and worst performing agents
        best_agent = max(agent_results, key=lambda x: x.get("passed", 0))
        worst_agent = min(agent_results, key=lambda x: x.get("passed", 0))
        
        best_config = best_agent.get("sandbox_config_id")
        worst_config = worst_agent.get("sandbox_config_id")
        
        insights.append(f"Best performing configuration: {best_config}")
        insights.append(f"Lowest performing configuration: {worst_config}")
        
        # Performance insights
        total_cases = run_data.get("total_cases", 0)
        overall_success_rate = run_data.get("passed_cases", 0) / total_cases if total_cases > 0 else 0
        
        if overall_success_rate < 0.8:
            insights.append("Overall success rate is below 80% - consider reviewing agent configurations")
        elif overall_success_rate > 0.95:
            insights.append("Excellent success rate achieved - current configurations are performing well")
        
        # Time-based insights
        execution_time_ms = run_data.get("total_time_ms", 0)
        if execution_time_ms > 300000:  # 5 minutes
            insights.append("Test execution time is high - consider optimizing response times or reducing test scope")
        
        return insights
    
    def create_performance_visualizations(self, analysis: Dict, output_dir: str = "./visualizations"):
        """Create visualizations for performance analysis."""
        
        import os
        os.makedirs(output_dir, exist_ok=True)
        
        # Set style
        plt.style.use('default')
        sns.set_palette("husl")
        
        # 1. Agent Success Rate Comparison
        agent_data = analysis["agent_comparison"]
        if agent_data:
            fig, ax = plt.subplots(figsize=(12, 6))
            
            configs = [agent["config_id"] for agent in agent_data]
            success_rates = [agent["success_rate"] * 100 for agent in agent_data]
            
            bars = ax.bar(configs, success_rates)
            ax.set_title("Agent Success Rate Comparison", fontsize=16, fontweight='bold')
            ax.set_ylabel("Success Rate (%)", fontsize=12)
            ax.set_xlabel("Agent Configuration", fontsize=12)
            ax.set_ylim(0, 100)
            
            # Add value labels on bars
            for bar, rate in zip(bars, success_rates):
                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                       f"{rate:.1f}%", ha='center', va='bottom', fontweight='bold')
            
            plt.xticks(rotation=45, ha='right')
            plt.tight_layout()
            plt.savefig(f"{output_dir}/success_rate_comparison.png", dpi=300, bbox_inches='tight')
            plt.close()
        
        # 2. Response Time Comparison
        if agent_data:
            fig, ax = plt.subplots(figsize=(12, 6))
            
            response_times = [agent["avg_response_time_ms"] for agent in agent_data]
            
            bars = ax.bar(configs, response_times)
            ax.set_title("Average Response Time by Agent", fontsize=16, fontweight='bold')
            ax.set_ylabel("Response Time (ms)", fontsize=12)
            ax.set_xlabel("Agent Configuration", fontsize=12)
            
            # Add value labels
            for bar, time in zip(bars, response_times):
                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(response_times)*0.01,
                       f"{time:.0f}ms", ha='center', va='bottom', fontweight='bold')
            
            plt.xticks(rotation=45, ha='right')
            plt.tight_layout()
            plt.savefig(f"{output_dir}/response_time_comparison.png", dpi=300, bbox_inches='tight')
            plt.close()
        
        # 3. Token Usage Comparison
        if agent_data:
            fig, ax = plt.subplots(figsize=(12, 6))
            
            token_usage = [agent["avg_tokens"] for agent in agent_data]
            
            bars = ax.bar(configs, token_usage)
            ax.set_title("Average Token Usage by Agent", fontsize=16, fontweight='bold')
            ax.set_ylabel("Average Tokens per Request", fontsize=12)
            ax.set_xlabel("Agent Configuration", fontsize=12)
            
            # Add value labels
            for bar, tokens in zip(bars, token_usage):
                ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(token_usage)*0.01,
                       f"{tokens:.0f}", ha='center', va='bottom', fontweight='bold')
            
            plt.xticks(rotation=45, ha='right')
            plt.tight_layout()
            plt.savefig(f"{output_dir}/token_usage_comparison.png", dpi=300, bbox_inches='tight')
            plt.close()
        
        print(f"📊 Visualizations saved to {output_dir}/")

def main():
    """Run performance analysis on a test run."""
    
    analyzer = PerformanceAnalyzer()
    
    # Replace with actual run ID
    run_id = input("Enter test run ID to analyze: ").strip()
    
    try:
        print(f"🔍 Analyzing test run: {run_id}")
        analysis = analyzer.analyze_test_run(run_id)
        
        # Display results
        print("\n" + "="*60)
        print("📊 PERFORMANCE ANALYSIS RESULTS")
        print("="*60)
        
        overview = analysis["overview"]
        print(f"\n📈 Overview:")
        print(f"   Total test cases: {overview['total_cases']}")
        print(f"   Success rate: {overview['success_rate']:.1%}")
        print(f"   Execution time: {overview['execution_time_minutes']:.1f} minutes")
        print(f"   Total tokens used: {overview['total_tokens']:,}")
        print(f"   Agents tested: {overview['agents_tested']}")
        
        print(f"\n🤖 Agent Rankings:")
        for i, agent in enumerate(analysis["agent_comparison"], 1):
            print(f"   {i}. {agent['config_id']}")
            print(f"      Success rate: {agent['success_rate']:.1%}")
            print(f"      Avg response time: {agent['avg_response_time_ms']:.0f}ms")
            print(f"      Avg tokens: {agent['avg_tokens']:.0f}")
        
        print(f"\n💡 Key Insights:")
        for insight in analysis["recommendations"]:
            print(f"   • {insight}")
        
        # Generate visualizations
        try:
            analyzer.create_performance_visualizations(analysis)
        except ImportError:
            print("\n⚠️  Visualization libraries not available (matplotlib, seaborn)")
            print("   Install with: pip install matplotlib seaborn pandas")
        
        # Save detailed analysis
        with open(f"performance_analysis_{run_id}.json", "w") as f:
            json.dump(analysis, f, indent=2)
        print(f"\n💾 Detailed analysis saved to: performance_analysis_{run_id}.json")
        
    except Exception as e:
        print(f"❌ Analysis failed: {str(e)}")

if __name__ == "__main__":
    main()

Phase 6: CI/CD Integration

6.1 Create GitHub Actions Workflow

Create .github/workflows/agent-testing.yml:

name: Multi-Agent Testing Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]
  schedule:
    # Run daily at 2 AM UTC
    - cron: '0 2 * * *'

jobs:
  multi-agent-test:
    runs-on: ubuntu-latest
    
    services:
      coagent:
        image: coagent:latest
        ports:
          - 3000:3000
        env:
          RUST_LOG: info
    
    steps:
    - name: Checkout code
      uses: actions/checkout@v3
    
    - name: Set up Python
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'
    
    - name: Install dependencies
      run: |
        pip install requests pandas matplotlib seaborn
    
    - name: Wait for CoAgent to start
      run: |
        timeout 60 bash -c 'until curl -f http://localhost:3000/health; do sleep 2; done'
    
    - name: Set up test environment
      env:
        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      run: |
        python create_test_environment.py
    
    - name: Run multi-agent tests
      run: |
        python multi_agent_test_runner.py > test_results.txt 2>&1
        echo "TEST_EXIT_CODE=$?" >> $GITHUB_ENV
    
    - name: Generate performance report
      if: always()
      run: |
        python performance_analyzer.py --automated
    
    - name: Upload test artifacts
      if: always()
      uses: actions/upload-artifact@v3
      with:
        name: test-results
        path: |
          test_results.txt
          test_report_*.json
          performance_analysis_*.json
          visualizations/
    
    - name: Post results to PR
      if: github.event_name == 'pull_request'
      uses: actions/github-script@v6
      with:
        script: |
          const fs = require('fs');
          try {
            const results = fs.readFileSync('test_results.txt', 'utf8');
            const lines = results.split('\n');
            const summary = lines.filter(line => 
              line.includes('Status:') || 
              line.includes('Success rate:') || 
              line.includes('Execution time:')
            ).join('\n');
            
            await github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## 🤖 Multi-Agent Test Results\n\n\`\`\`\n${summary}\n\`\`\`\n\nFull results available in the workflow artifacts.`
            });
          } catch (error) {
            console.log('Could not post results:', error);
          }
    
    - name: Fail if tests failed
      if: env.TEST_EXIT_CODE != '0'
      run

6.2 Create Automated Environment Setup

Create create_test_environment.py:

#!/usr/bin/env python3
"""
Automated test environment setup for CI/CD
"""

import os
import requests
import json
import time

def wait_for_coagent(url: str = "http://localhost:3000", timeout: int = 60):
    """Wait for CoAgent to be ready."""
    
    start_time = time.time()
    while time.time() - start_time < timeout:
        try:
            response = requests.get(f"{url}/health")
            if response.status_code == 200:
                print("✅ CoAgent is ready")
                return True
        except requests.exceptions.ConnectionError:
            pass
        
        time.sleep(2)
    
    raise Exception("CoAgent failed to start within timeout")

def setup_providers():
    """Set up model providers from environment variables."""
    
    providers = []
    
    # OpenAI
    openai_key = os.environ.get("OPENAI_API_KEY")
    if openai_key:
        providers.append({
            "name": "OpenAI CI",
            "provider_type": "openai",
            "api_key": openai_key,
            "available_models": ["gpt-4", "gpt-3.5-turbo"]
        })
    
    # Anthropic
    anthropic_key = os.environ.get("ANTHROPIC_API_KEY")
    if anthropic_key:
        providers.append({
            "name": "Anthropic CI",
            "provider_type": "anthropic", 
            "api_key": anthropic_key,
            "available_models": ["claude-3-sonnet", "claude-3-haiku"]
        })
    
    # Create providers
    for provider in providers:
        response = requests.post("http://localhost:3000/api/v1/providers", json=provider)
        if response.status_code == 200:
            print(f"✅ Created provider: {provider['name']}")
        else:
            print(f"❌ Failed to create provider: {provider['name']}")
            print(f"   Error: {response.text}")

def setup_agents_and_configs():
    """Set up test agents and configurations."""
    
    # This would contain the same setup logic as previous phases
    # Simplified for CI/CD environment
    
    agents = [
        {
            "name": "ci-test-agent-conservative",
            "description": "Conservative test agent for CI",
            "preamble": "You are a careful, accurate customer support agent."
        },
        {
            "name": "ci-test-agent-fast",
            "description": "Fast response test agent for CI", 
            "preamble": "You are a quick, efficient customer support agent."
        }
    ]
    
    for agent in agents:
        response = requests.post("http://localhost:3000/api/v1/agents", json=agent)
        if response.status_code == 200:
            print(f"✅ Created agent: {agent['name']}")
        else:
            print(f"❌ Failed to create agent: {agent['name']}")

def main():
    """Set up complete test environment for CI/CD."""
    
    print("🚀 Setting up test environment for CI/CD...")
    
    # Wait for CoAgent to be ready
    wait_for_coagent()
    
    # Set up providers
    setup_providers()
    
    # Set up agents and configurations
    setup_agents_and_configs()
    
    print("✅ Test environment setup complete!")

if __name__ == "__main__":
    main()

Phase 7: Advanced Testing Strategies

7.1 Load Testing for Multi-Agent Scenarios

#!/usr/bin/env python3
"""
Load testing for multi-agent scenarios
"""

import asyncio
import aiohttp
import time
import json
from typing import List, Dict
from dataclasses import dataclass
import statistics

@dataclass
class LoadTestConfig:
    """Configuration for load testing."""
    concurrent_requests: int
    total_requests: int
    ramp_up_seconds: int
    test_prompts: List[str]
    agent_configs: List[str]

class MultiAgentLoadTester:
    """Load testing framework for multi-agent scenarios."""
    
    def __init__(self, base_url: str = "http://localhost:3000/api/v1"):
        self.base_url = base_url
        self.results = []
    
    async def run_load_test(self, config: LoadTestConfig) -> Dict:
        """Execute load test across multiple agents."""
        
        print(f"🚀 Starting load test:")
        print(f"   Concurrent requests: {config.concurrent_requests}")
        print(f"   Total requests: {config.total_requests}")
        print(f"   Agent configurations: {len(config.agent_configs)}")
        
        start_time = time.time()
        
        # Create semaphore to control concurrency
        semaphore = asyncio.Semaphore(config.concurrent_requests)
        
        # Generate request tasks
        tasks = []
        for i in range(config.total_requests):
            # Round-robin through agents and prompts
            agent_config = config.agent_configs[i % len(config.agent_configs)]
            prompt = config.test_prompts[i % len(config.test_prompts)]
            
            # Add ramp-up delay
            delay = (i / config.total_requests) * config.ramp_up_seconds
            
            task = asyncio.create_task(
                self._make_request(semaphore, agent_config, prompt, delay, i)
            )
            tasks.append(task)
        
        # Execute all tasks
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        total_time = time.time() - start_time
        
        # Analyze results
        analysis = self._analyze_load_test_results(results, total_time)
        
        return analysis
    
    async def _make_request(self, semaphore: asyncio.Semaphore, 
                           agent_config: str, prompt: str, 
                           delay: float, request_id: int) -> Dict:
        """Make a single request with timing and error tracking."""
        
        # Wait for ramp-up delay
        if delay > 0:
            await asyncio.sleep(delay)
        
        async with semaphore:
            request_start = time.time()
            
            try:
                async with aiohttp.ClientSession() as session:
                    # Create a simple evaluation request
                    payload = {
                        "run_id": f"load-test-{request_id}",
                        "preamble": "You are a helpful customer support agent.",
                        "prompt": prompt,
                        "model_ref": {
                            "provider_id": "openai-ci",
                            "model_name": "gpt-3.5-turbo"
                        },
                        "log_meta": {"load_test": True, "agent_config": agent_config}
                    }
                    
                    async with session.post(f"{self.base_url}/evals", json=payload) as response:
                        response_time = time.time() - request_start
                        
                        if response.status == 200:
                            data = await response.json()
                            return {
                                "success": True,
                                "response_time": response_time,
                                "agent_config": agent_config,
                                "request_id": request_id,
                                "token_usage": data.get("token_usage", {}),
                                "response_length": len(data.get("response", ""))
                            }
                        else:
                            return {
                                "success": False,
                                "response_time": response_time,
                                "agent_config": agent_config,
                                "request_id": request_id,
                                "error": f"HTTP {response.status}",
                                "error_detail": await response.text()
                            }
            
            except Exception as e:
                response_time = time.time() - request_start
                return {
                    "success": False,
                    "response_time": response_time,
                    "agent_config": agent_config,
                    "request_id": request_id,
                    "error": str(e)
                }
    
    def _analyze_load_test_results(self, results: List, total_time: float) -> Dict:
        """Analyze load test results and generate metrics."""
        
        successful_requests = [r for r in results if isinstance(r, dict) and r.get("success")]
        failed_requests = [r for r in results if isinstance(r, dict) and not r.get("success")]
        exceptions = [r for r in results if isinstance(r, Exception)]
        
        analysis = {
            "summary": {
                "total_requests": len(results),
                "successful_requests": len(successful_requests),
                "failed_requests": len(failed_requests) + len(exceptions),
                "success_rate": len(successful_requests) / len(results),
                "total_test_time": total_time,
                "requests_per_second": len(results) / total_time
            },
            "performance_metrics": {},
            "error_analysis": {},
            "agent_comparison": {}
        }
        
        if successful_requests:
            response_times = [r["response_time"] for r in successful_requests]
            
            analysis["performance_metrics"] = {
                "avg_response_time": statistics.mean(response_times),
                "median_response_time": statistics.median(response_times),
                "p95_response_time": statistics.quantiles(response_times, n=20)[18] if len(response_times) > 20 else max(response_times),
                "min_response_time": min(response_times),
                "max_response_time": max(response_times),
                "response_time_std": statistics.stdev(response_times) if len(response_times) > 1 else 0
            }
        
        # Analyze errors
        error_types = {}
        for failed_request in failed_requests + exceptions:
            if isinstance(failed_request, Exception):
                error_type = type(failed_request).__name__
            else:
                error_type = failed_request.get("error", "Unknown")
            
            error_types[error_type] = error_types.get(error_type, 0) + 1
        
        analysis["error_analysis"] = {
            "error_types": error_types,
            "most_common_error": max(error_types.items(), key=lambda x: x[1])[0] if error_types else None
        }
        
        # Agent-specific analysis
        agent_performance = {}
        for request in successful_requests:
            agent_config = request["agent_config"]
            if agent_config not in agent_performance:
                agent_performance[agent_config] = []
            agent_performance[agent_config].append(request["response_time"])
        
        for agent, response_times in agent_performance.items():
            analysis["agent_comparison"][agent] = {
                "request_count": len(response_times),
                "avg_response_time": statistics.mean(response_times),
                "success_rate": len(response_times) / len([r for r in results if r.get("agent_config") == agent])
            }
        
        return analysis

def main():
    """Run multi-agent load test."""
    
    config = LoadTestConfig(
        concurrent_requests=10,
        total_requests=100,
        ramp_up_seconds=30,
        test_prompts=[
            "I need help with returning a product",
            "What's the status of my order?",
            "I was charged incorrectly", 
            "Can you help me find a product?",
            "I need to update my account information"
        ],
        agent_configs=[
            "conservative-config",
            "fast-config",
            "balanced-config"
        ]
    )
    
    tester = MultiAgentLoadTester()
    
    # Run the load test
    try:
        analysis = asyncio.run(tester.run_load_test(config))
        
        print("\n" + "="*60)
        print("⚡ LOAD TEST RESULTS")
        print("="*60)
        
        summary = analysis["summary"]
        print(f"\n📊 Summary:")
        print(f"   Total requests: {summary['total_requests']}")
        print(f"   Successful: {summary['successful_requests']}")
        print(f"   Failed: {summary['failed_requests']}")
        print(f"   Success rate: {summary['success_rate']:.1%}")
        print(f"   Requests/second: {summary['requests_per_second']:.2f}")
        print(f"   Total time: {summary['total_test_time']:.2f}s")
        
        if analysis["performance_metrics"]:
            perf = analysis["performance_metrics"]
            print(f"\n⏱️  Performance:")
            print(f"   Avg response time: {perf['avg_response_time']:.2f}s")
            print(f"   Median response time: {perf['median_response_time']:.2f}s")
            print(f"   95th percentile: {perf['p95_response_time']:.2f}s")
            print(f"   Min/Max: {perf['min_response_time']:.2f}s / {perf['max_response_time']:.2f}s")
        
        if analysis["agent_comparison"]:
            print(f"\n🤖 Agent Performance:")
            for agent, metrics in analysis["agent_comparison"].items():
                print(f"   {agent}:")
                print(f"     Requests: {metrics['request_count']}")
                print(f"     Avg time: {metrics['avg_response_time']:.2f}s")
                print(f"     Success rate: {metrics['success_rate']:.1%}")
        
        if analysis["error_analysis"]["error_types"]:
            print(f"\n❌ Errors:")
            for error_type, count in analysis["error_analysis"]["error_types"].items():
                print(f"   {error_type}: {count}")
        
        # Save detailed results
        with open(f"load_test_results_{int(time.time())}.json", "w") as f:
            json.dump(analysis, f, indent=2)
        
    except Exception as e:
        print(f"❌ Load test failed: {str(e)}")
        return False
    
    return True

if __name__ == "__main__":
    success = main()
    exit(0 if success else 1)

7.2 A/B Testing Framework

#!/usr/bin/env python3
"""
A/B Testing framework for comparing agent configurations
"""

import random
import requests
import json
import time
from typing import Dict, List, Tuple
from dataclasses import dataclass
import statistics

@dataclass
class ABTestConfig:
    """Configuration for A/B testing."""
    variant_a_config: str
    variant_b_config: str
    test_prompts: List[str]
    sample_size: int
    significance_level: float = 0.05

class ABTester:
    """A/B testing framework for agent configurations."""
    
    def __init__(self, base_url: str = "http://localhost:3000/api/v1"):
        self.base_url = base_url
    
    def run_ab_test(self, config: ABTestConfig) -> Dict:
        """Run A/B test comparing two agent configurations."""
        
        print(f"🔬 Starting A/B Test:")
        print(f"   Variant A: {config.variant_a_config}")
        print(f"   Variant B: {config.variant_b_config}")
        print(f"   Sample size: {config.sample_size} per variant")
        
        results_a = []
        results_b = []
        
        # Run tests for both variants
        total_tests = config.sample_size * 2
        
        for i in range(total_tests):
            # Randomize assignment to variants
            if random.random() < 0.5 and len(results_a) < config.sample_size:
                variant = "A"
                variant_config = config.variant_a_config
                results_list = results_a
            elif len(results_b) < config.sample_size:
                variant = "B"
                variant_config = config.variant_b_config
                results_list = results_b
            else:
                variant = "A"
                variant_config = config.variant_a_config
                results_list = results_a
            
            # Random prompt selection
            prompt = random.choice(config.test_prompts)
            
            # Execute test
            result = self._execute_single_test(variant_config, prompt, variant, i)
            results_list.append(result)
            
            # Progress update
            if (i + 1) % 10 == 0:
                progress = (i + 1) / total_tests * 100
                print(f"   Progress: {progress:.1f}% ({len(results_a)} A, {len(results_b)} B)")
        
        # Analyze results
        analysis = self._analyze_ab_results(results_a, results_b, config)
        
        return analysis
    
    def _execute_single_test(self, config_id: str, prompt: str, variant: str, test_id: int) -> Dict:
        """Execute a single test case."""
        
        start_time = time.time()
        
        try:
            # Create evaluation request
            payload = {
                "run_id": f"ab-test-{variant.lower()}-{test_id}",
                "preamble": "You are a helpful customer support agent.",
                "prompt": prompt,
                "model_ref": {
                    "provider_id": "openai-ci",
                    "model_name": "gpt-3.5-turbo"
                },
                "log_meta": {
                    "ab_test": True,
                    "variant": variant,
                    "config_id": config_id
                }
            }
            
            response = requests.post(f"{self.base_url}/evals", json=payload)
            response_time = time.time() - start_time
            
            if response.status_code == 200:
                data = response.json()
                return {
                    "success": True,
                    "variant": variant,
                    "config_id": config_id,
                    "prompt": prompt,
                    "response": data.get("response", ""),
                    "response_time": response_time,
                    "token_usage": data.get("token_usage", {}),
                    "test_id": test_id
                }
            else:
                return {
                    "success": False,
                    "variant": variant,
                    "config_id": config_id,
                    "error": f"HTTP {response.status_code}",
                    "response_time": response_time,
                    "test_id": test_id
                }
        
        except Exception as e:
            response_time = time.time() - start_time
            return {
                "success": False,
                "variant": variant,
                "config_id": config_id,
                "error": str(e),
                "response_time": response_time,
                "test_id": test_id
            }
    
    def _analyze_ab_results(self, results_a: List[Dict], results_b: List[Dict], config: ABTestConfig) -> Dict:
        """Analyze A/B test results with statistical significance testing."""
        
        # Calculate success rates
        success_a = sum(1 for r in results_a if r["success"])
        success_b = sum(1 for r in results_b if r["success"])
        
        success_rate_a = success_a / len(results_a) if results_a else 0
        success_rate_b = success_b / len(results_b) if results_b else 0
        
        # Calculate response times for successful requests
        times_a = [r["response_time"] for r in results_a if r["success"]]
        times_b = [r["response_time"] for r in results_b if r["success"]]
        
        # Statistical significance testing (simplified)
        significance_test = self._simple_significance_test(
            success_a, len(results_a), success_b, len(results_b), config.significance_level
        )
        
        analysis = {
            "test_config": {
                "variant_a": config.variant_a_config,
                "variant_b": config.variant_b_config,
                "sample_size_per_variant": config.sample_size,
                "significance_level": config.significance_level
            },
            "results": {
                "variant_a": {
                    "total_tests": len(results_a),
                    "successful_tests": success_a,
                    "success_rate": success_rate_a,
                    "avg_response_time": statistics.mean(times_a) if times_a else 0,
                    "median_response_time": statistics.median(times_a) if times_a else 0
                },
                "variant_b": {
                    "total_tests": len(results_b),
                    "successful_tests": success_b,
                    "success_rate": success_rate_b,
                    "avg_response_time": statistics.mean(times_b) if times_b else 0,
                    "median_response_time": statistics.median(times_b) if times_b else 0
                }
            },
            "comparison": {
                "success_rate_difference": success_rate_b - success_rate_a,
                "response_time_difference": (statistics.mean(times_b) if times_b else 0) - (statistics.mean(times_a) if times_a else 0),
                "statistically_significant": significance_test["significant"],
                "confidence_level": 1 - config.significance_level,
                "winner": self._determine_winner(success_rate_a, success_rate_b, times_a, times_b, significance_test["significant"])
            },
            "recommendations": self._generate_ab_recommendations(success_rate_a, success_rate_b, times_a, times_b, significance_test)
        }
        
        return analysis
    
    def _simple_significance_test(self, successes_a: int, total_a: int, 
                                 successes_b: int, total_b: int, alpha: float) -> Dict:
        """Simplified significance test for success rates."""
        
        # This is a simplified implementation
        # In production, you'd want to use proper statistical libraries
        
        p_a = successes_a / total_a if total_a > 0 else 0
        p_b = successes_b / total_b if total_b > 0 else 0
        
        # Pool proportion for standard error calculation
        p_pool = (successes_a + successes_b) / (total_a + total_b) if (total_a + total_b) > 0 else 0
        
        # Standard error
        se = (p_pool * (1 - p_pool) * (1/total_a + 1/total_b)) ** 0.5 if p_pool > 0 and total_a > 0 and total_b > 0 else 0
        
        # Z-score
        z_score = (p_b - p_a) / se if se > 0 else 0
        
        # Critical value for two-tailed test (simplified)
        critical_value = 1.96 if alpha == 0.05 else 2.58  # approximation
        
        significant = abs(z_score) > critical_value
        
        return {
            "significant": significant,
            "z_score": z_score,
            "p_value_approx": 2 * (1 - abs(z_score)/2) if abs(z_score) < 2 else 0.05,  # very rough approximation
            "critical_value": critical_value
        }
    
    def _determine_winner(self, success_rate_a: float, success_rate_b: float,
                         times_a: List[float], times_b: List[float], significant: bool) -> str:
        """Determine the winning variant based on multiple criteria."""
        
        if not significant:
            return "No significant difference"
        
        # Primary criterion: success rate
        if success_rate_b > success_rate_a:
            primary_winner = "B"
        elif success_rate_a > success_rate_b:
            primary_winner = "A"
        else:
            primary_winner = "Tie"
        
        # Secondary criterion: response time (only if success rates are close)
        if abs(success_rate_b - success_rate_a) < 0.05:  # Less than 5% difference
            avg_time_a = statistics.mean(times_a) if times_a else float('inf')
            avg_time_b = statistics.mean(times_b) if times_b else float('inf')
            
            if avg_time_a < avg_time_b:
                return f"{primary_winner} (A faster)" if primary_winner == "Tie" else f"{primary_winner} (also faster)"
            elif avg_time_b < avg_time_a:
                return f"{primary_winner} (B faster)" if primary_winner == "Tie" else f"{primary_winner} (also faster)"
        
        return primary_winner
    
    def _generate_ab_recommendations(self, success_rate_a: float, success_rate_b: float,
                                   times_a: List[float], times_b: List[float],
                                   significance_test: Dict) -> List[str]:
        """Generate actionable recommendations from A/B test results."""
        
        recommendations = []
        
        if significance_test["significant"]:
            if success_rate_b > success_rate_a:
                improvement = ((success_rate_b - success_rate_a) / success_rate_a) * 100
                recommendations.append(f"Variant B shows {improvement:.1f}% improvement in success rate - recommend deployment")
            elif success_rate_a > success_rate_b:
                improvement = ((success_rate_a - success_rate_b) / success_rate_b) * 100
                recommendations.append(f"Variant A shows {improvement:.1f}% improvement in success rate - recommend keeping current config")
        else:
            recommendations.append("No statistically significant difference found - consider longer test or different metrics")
        
        # Response time recommendations
        if times_a and times_b:
            avg_time_a = statistics.mean(times_a)
            avg_time_b = statistics.mean(times_b)
            time_diff = abs(avg_time_b - avg_time_a)
            
            if time_diff > 0.5:  # More than 0.5 second difference
                faster_variant = "A" if avg_time_a < avg_time_b else "B"
                recommendations.append(f"Variant {faster_variant} is significantly faster ({time_diff:.2f}s difference)")
        
        # Sample size recommendations
        if not significance_test["significant"]:
            recommendations.append("Consider increasing sample size for more reliable results")
        
        return recommendations

def main():
    """Run A/B test example."""
    
    config = ABTestConfig(
        variant_a_config="conservative-support-config",
        variant_b_config="fast-support-config",
        test_prompts=[
            "I need help with a product return",
            "What's my order status?",
            "I have a billing question",
            "Can you help me find a product?",
            "I need technical support"
        ],
        sample_size=50,
        significance_level=0.05
    )
    
    tester = ABTester()
    
    try:
        analysis = tester.run_ab_test(config)
        
        print("\n" + "="*60)
        print("🔬 A/B TEST RESULTS")
        print("="*60)
        
        results = analysis["results"]
        comparison = analysis["comparison"]
        
        print(f"\n📊 Results Summary:")
        print(f"   Variant A ({config.variant_a_config}):")
        print(f"     Success rate: {results['variant_a']['success_rate']:.1%}")
        print(f"     Avg response time: {results['variant_a']['avg_response_time']:.2f}s")
        
        print(f"   Variant B ({config.variant_b_config}):")
        print(f"     Success rate: {results['variant_b']['success_rate']:.1%}")
        print(f"     Avg response time: {results['variant_b']['avg_response_time']:.2f}s")
        
        print(f"\n🎯 Comparison:")
        print(f"   Success rate difference: {comparison['success_rate_difference']:+.1%}")
        print(f"   Response time difference: {comparison['response_time_difference']:+.2f}s")
        print(f"   Statistically significant: {comparison['statistically_significant']}")
        print(f"   Winner: {comparison['winner']}")
        
        print(f"\n💡 Recommendations:")
        for rec in analysis["recommendations"]:
            print(f"   • {rec}")
        
        # Save results
        with open(f"ab_test_results_{int(time.time())}.json", "w") as f:
            json.dump(analysis, f, indent=2)
        print(f"\n💾 Results saved to ab_test_results_{int(time.time())}.json")
        
    except Exception as e:
        print(f"❌ A/B test failed: {str(e)}")
        return False
    
    return True

if __name__ == "__main__":
    success = main()
    exit(0 if success else 1)

Summary

🎉 Congratulations! You've built a comprehensive multi-agent testing pipeline that includes:

✅ Complete Test Suite Design - Structured test cases with multiple validation types
✅ Multi-Agent Comparison - Automated testing across different configurations
✅ Performance Analysis - Detailed metrics and visualizations
✅ CI/CD Integration - Automated testing in development workflows
✅ Advanced Testing Strategies - Load testing and A/B testing frameworks

Key Benefits Achieved

Systematic Comparison - Objective evaluation of agent performance
Automated Quality Assurance - Continuous testing prevents regressions
Data-Driven Decisions - Statistical significance testing for configuration choices
Performance Optimization - Identification of bottlenecks and optimization opportunities
Production Readiness - Comprehensive testing before deployment

Next Steps

Rust Client Integration Tutorial - High-performance production integrations
Testing & QA Guide - Advanced testing strategies and best practices
Monitoring Guide - Production monitoring and alerting
REST API Reference - Complete API documentation for custom testing tools

Build your first AI Agent evaluation and test using Python

Multi-agent testing using Rust