Search Docsā¦
Search Docsā¦
Tutorials
Multi-agent testing using Python
This tutorial guides you through creating a comprehensive multi-agent testing pipeline using CoAgent's Test Studio. You'll learn to compare different agent configurations, analyze performance metrics, and build automated testing workflows.
What You'll Build
By the end of this tutorial, you'll have:
- A complete test suite with multiple validation types 
- Automated testing pipeline comparing different agent configurations 
- Performance analysis dashboard and reporting 
- Continuous integration testing setup 
- Best practices for multi-agent evaluation 
Prerequisites
- CoAgent running locally ( - docker-compose up)
- Completed Getting Started Guide 
- Basic understanding of agent configurations 
- Familiarity with Test Studio interface (recommended) 
Tutorial Overview
Phase 1: Test Suite Design & Creation Phase 2: Multi-Agent Configuration Setup Phase 3: Comprehensive Test Cases Phase 4: Automated Test Execution Phase 5: Performance Analysis & Comparison Phase 6: CI/CD Integration Phase 7: Advanced Testing Strategies
Phase 1: Test Suite Design & Creation
1.1 Define Testing Objectives
For this tutorial, we'll create a customer support agent testing pipeline that evaluates:
- Response Quality: How helpful and accurate are the responses? 
- Consistency: Do agents give similar quality across different prompts? 
- Performance: Response times and token efficiency 
- Tool Usage: Appropriate use of available tools 
- Error Handling: Graceful handling of edge cases 
1.2 Create the Test Suite Structure
Navigate to the Test Studio at http://localhost:3000/test-studio and create a new test suite:
{ "name": "Customer Support Agent Evaluation", "description": "Comprehensive testing suite for customer support agent configurations across different scenarios and performance metrics", "test_categories": [ "basic_functionality", "edge_cases", "performance", "tool_integration", "error_handling" ] }
1.3 Design Test Case Templates
Create a structured approach to test case design:
Template 1: Basic Functionality Tests
{ "category": "basic_functionality", "input_template": "[SCENARIO] for customer support", "validations": [ {"type": "content_match", "criteria": "helpful_response_patterns"}, {"type": "response_time", "max_seconds": 5}, {"type": "semantic_similarity", "benchmark": "ideal_response"} ] }
Template 2: Tool Integration Tests
{ "category": "tool_integration", "input_template": "[SCENARIO] requiring [TOOL_NAME]", "validations": [ {"type": "tool_call", "expected_tool": "[TOOL_NAME]"}, {"type": "response_schema", "schema": "tool_response_format"}, {"type": "content_match", "criteria": "tool_result_usage"} ] }
Phase 2: Multi-Agent Configuration Setup
2.1 Create Test Agent Configurations
We'll create multiple agent configurations to compare:
Agent 1: Conservative Support (GPT-4)
curl -X POST http://localhost:3000/api/v1/agents \ -H "Content-Type: application/json" \ -d '{ "name": "conservative-support", "description": "Conservative customer support agent focused on accuracy", "preamble": "You are a careful, professional customer support agent. Always verify information before responding. Prioritize accuracy over speed. When in doubt, escalate to human support." }'
Agent 2: Fast Support (GPT-3.5 Turbo)
curl -X POST http://localhost:3000/api/v1/agents \ -H "Content-Type: application/json" \ -d '{ "name": "fast-support", "description": "Quick-response customer support agent for high volume", "preamble": "You are an efficient customer support agent. Provide quick, helpful responses. Be concise but friendly. Handle common issues directly." }'
Agent 3: Empathetic Support (Claude-3 Sonnet)
curl -X POST http://localhost:3000/api/v1/agents \ -H "Content-Type: application/json" \ -d '{ "name": "empathetic-support", "description": "Empathetic customer support agent focused on customer satisfaction", "preamble": "You are a warm, empathetic customer support agent. Focus on understanding the customer's feelings and concerns. Provide emotional support along with practical solutions." }'
2.2 Set Up Model Providers
Create model providers for different LLM services:
# OpenAI Provider curl -X POST http://localhost:3000/api/v1/providers \ -H "Content-Type: application/json" \ -d '{ "name": "OpenAI Production", "provider_type": "openai", "api_key": "your-openai-key", "available_models": ["gpt-4", "gpt-3.5-turbo"] }' # Anthropic Provider curl -X POST http://localhost:3000/api/v1/providers \ -H "Content-Type: application/json" \ -d '{ "name": "Anthropic Claude", "provider_type": "anthropic", "api_key": "your-anthropic-key", "available_models": ["claude-3-sonnet", "claude-3-haiku"] }'
2.3 Create Bound Agents
Link agent configurations with model providers:
# Conservative Support with GPT-4 curl -X POST http://localhost:3000/api/v1/bound_agents \ -H "Content-Type: application/json" \ -d '{ "name": "conservative-support-gpt4", "description": "Conservative support agent using GPT-4", "agent_config_name": "conservative-support", "model_provider_name": "OpenAI Production" }' # Fast Support with GPT-3.5 curl -X POST http://localhost:3000/api/v1/bound_agents \ -H "Content-Type: application/json" \ -d '{ "name": "fast-support-gpt35", "description": "Fast support agent using GPT-3.5 Turbo", "agent_config_name": "fast-support", "model_provider_name": "OpenAI Production" }' # Empathetic Support with Claude curl -X POST http://localhost:3000/api/v1/bound_agents \ -H "Content-Type: application/json" \ -d '{ "name": "empathetic-support-claude", "description": "Empathetic support agent using Claude-3 Sonnet", "agent_config_name": "empathetic-support", "model_provider_name": "Anthropic Claude" }'
2.4 Create Sandbox Configurations
Set up different runtime environments for testing:
#!/usr/bin/env python3 """ Create sandbox configurations for multi-agent testing """ import requests import json def create_sandbox_configs(): """Create sandbox configurations for different testing scenarios.""" base_url = "http://localhost:3000/api/v1/sandbox-configs" configs = [ { "name": "Conservative Testing Environment", "description": "High accuracy, longer response times acceptable", "system_prompt": "Focus on providing accurate, well-researched responses. Take time to verify information.", "parameters": { "temperature": 0.2, "max_tokens": 1024, "top_p": 0.8 }, "tools": [], "category": "accuracy_focused" }, { "name": "Fast Response Environment", "description": "Quick responses, optimized for speed", "system_prompt": "Provide quick, efficient responses. Be concise and direct.", "parameters": { "temperature": 0.3, "max_tokens": 512, "top_p": 0.9 }, "tools": [], "category": "speed_focused" }, { "name": "Balanced Environment", "description": "Balanced approach between accuracy and speed", "system_prompt": "Balance accuracy with efficiency. Provide helpful responses in reasonable time.", "parameters": { "temperature": 0.4, "max_tokens": 768, "top_p": 0.95 }, "tools": [ { "id": "knowledge-base-tools", "tool_names": ["search_kb", "get_policy"] } ], "category": "balanced" } ] created_configs = [] for config in configs: response = requests.post(base_url, json=config) if response.status_code == 200: created_configs.append(response.json()) print(f"ā Created config: {config['name']}") else: print(f"ā Failed to create config: {config['name']}") print(f" Error: {response.text}") return created_configs if __name__ == "__main__": create_sandbox_configs()
Phase 3: Comprehensive Test Cases
3.1 Create Basic Functionality Tests
#!/usr/bin/env python3 """ Create comprehensive test cases for customer support agent evaluation """ import requests import json from typing import List, Dict class TestCaseBuilder: """Builder for creating structured test cases.""" def __init__(self, base_url: str = "http://localhost:3000/api/v1"): self.base_url = base_url def create_test_suite(self, name: str, description: str) -> str: """Create a new test suite and return its ID.""" response = requests.post(f"{self.base_url}/testsets", json={ "name": name, "description": description, "cases": [] }) if response.status_code == 200: suite_id = response.json()["id_testset"] print(f"ā Created test suite: {name} (ID: {suite_id})") return suite_id else: raise Exception(f"Failed to create test suite: {response.text}") def add_basic_functionality_tests(self, suite_id: str) -> List[Dict]: """Add basic functionality test cases.""" test_cases = [ { "name": "Product Return Inquiry", "input": "I want to return a product I bought last week", "validations": [ { "type": "ContentMatch", "pattern": "(return policy|return process|refund|exchange)" }, { "type": "ResponseTime", "max_seconds": 5 }, { "type": "SemanticSimilarity", "sentence": "I can help you with your return. Let me guide you through our return process.", "threshold": 0.7 } ] }, { "name": "Order Status Check", "input": "Can you check the status of my order #12345?", "validations": [ { "type": "ToolCall", "tool_name": "order_lookup" }, { "type": "ContentMatch", "pattern": "(order|status|tracking|shipped|delivered)" } ] }, { "name": "Billing Question", "input": "I was charged twice for the same order", "validations": [ { "type": "ContentMatch", "pattern": "(billing|charge|refund|investigate|resolve)" }, { "type": "LlmV0", "criteria": "Rate the empathy and helpfulness of the response on a scale of 1-10. A score of 7 or higher passes." } ] }, { "name": "Product Information Request", "input": "What are the technical specifications of the XZ-100 model?", "validations": [ { "type": "ToolCall", "tool_name": "product_lookup" }, { "type": "ResponseSchema", "schema": { "type": "object", "properties": { "product_name": {"type": "string"}, "specifications": {"type": "object"} } } } ] } ] return self._add_test_cases(suite_id, test_cases) def add_edge_case_tests(self, suite_id: str) -> List[Dict]: """Add edge case and error handling test cases.""" edge_cases = [ { "name": "Angry Customer", "input": "This is ridiculous! Your product is broken and your service is terrible!", "validations": [ { "type": "LlmV0", "criteria": "Rate how well the response de-escalates the situation and shows empathy (1-10). Score 7+ passes." }, { "type": "ContentMatch", "pattern": "(understand|sorry|apologize|help|resolve)" } ] }, { "name": "Ambiguous Request", "input": "I have a problem with my thing", "validations": [ { "type": "ContentMatch", "pattern": "(clarify|more information|specific|help understand)" }, { "type": "LlmV0", "criteria": "Does the response appropriately ask for clarification? (Yes/No)" } ] }, { "name": "Out of Scope Request", "input": "Can you help me with my tax returns?", "validations": [ { "type": "ContentMatch", "pattern": "(outside|scope|not able|cannot help|tax professional)" } ] }, { "name": "Multiple Issues", "input": "I need to return a product, update my address, and cancel my subscription", "validations": [ { "type": "ContentMatch", "pattern": "(return|address|subscription)" }, { "type": "LlmV0", "criteria": "Does the response address all three issues mentioned? (Yes/No)" } ] } ] return self._add_test_cases(suite_id, edge_cases) def add_performance_tests(self, suite_id: str) -> List[Dict]: """Add performance-focused test cases.""" performance_tests = [ { "name": "Quick Response Test", "input": "What are your business hours?", "validations": [ { "type": "ResponseTime", "max_seconds": 2 }, { "type": "ContentMatch", "pattern": "(hours|open|close|Monday|business)" } ] }, { "name": "Complex Query Efficiency", "input": "I need to return a product that was a gift, but I don't have the receipt, and it was purchased with store credit from a previous return. Can you help?", "validations": [ { "type": "ResponseTime", "max_seconds": 8 }, { "type": "LlmV0", "criteria": "Does the response efficiently address the complex return scenario? Rate 1-10, need 7+." } ] } ] return self._add_test_cases(suite_id, performance_tests) def _add_test_cases(self, suite_id: str, test_cases: List[Dict]) -> List[Dict]: """Helper method to add test cases to a suite.""" added_cases = [] for test_case in test_cases: # Format for CoAgent API formatted_case = { "input": { "human_prompt": test_case["input"] }, "validations": [] } # Convert validation types to CoAgent format for validation in test_case["validations"]: if validation["type"] == "ContentMatch": formatted_case["validations"].append({ "kind": { "ContentMatch": { "pattern": validation["pattern"] } } }) elif validation["type"] == "ResponseTime": formatted_case["validations"].append({ "kind": { "ResponseTime": { "max_seconds": validation["max_seconds"] } } }) elif validation["type"] == "ToolCall": formatted_case["validations"].append({ "kind": { "ToolCall": { "tool_name": validation["tool_name"] } } }) elif validation["type"] == "SemanticSimilarity": formatted_case["validations"].append({ "kind": { "SemanticSimilarity": { "sentence": validation["sentence"], "threshold": validation["threshold"] } } }) elif validation["type"] == "ResponseSchema": formatted_case["validations"].append({ "kind": { "ResponseSchema": { "schema": validation["schema"] } } }) elif validation["type"] == "LlmV0": formatted_case["validations"].append({ "kind": { "LlmV0": { "llm0": { "model_ref": { "provider_id": "openai-eval", "model_name": "gpt-4" }, "criteria": validation["criteria"] } } } }) # Add to existing test suite response = requests.get(f"{self.base_url}/testsets/{suite_id}") if response.status_code == 200: suite_data = response.json() suite_data["cases"].append(formatted_case) # Update the suite update_response = requests.put(f"{self.base_url}/testsets/{suite_id}", json=suite_data) if update_response.status_code == 200: added_cases.append(test_case) print(f"ā Added test case: {test_case['name']}") else: print(f"ā Failed to add test case: {test_case['name']}") return added_cases def main(): """Create the complete test suite.""" builder = TestCaseBuilder() # Create main test suite suite_id = builder.create_test_suite( "Customer Support Multi-Agent Evaluation", "Comprehensive test suite for comparing customer support agent configurations across quality, performance, and reliability metrics" ) # Add different categories of test cases basic_cases = builder.add_basic_functionality_tests(suite_id) edge_cases = builder.add_edge_case_tests(suite_id) performance_cases = builder.add_performance_tests(suite_id) total_cases = len(basic_cases) + len(edge_cases) + len(performance_cases) print(f"\nš Test suite created successfully!") print(f" Suite ID: {suite_id}") print(f" Total test cases: {total_cases}") print(f" Basic functionality: {len(basic_cases)}") print(f" Edge cases: {len(edge_cases)}") print(f" Performance tests: {len(performance_cases)}") if __name__ == "__main__": main()
3.2 Run the Test Case Creation Script
Phase 4: Automated Test Execution
4.1 Create Test Execution Pipeline
#!/usr/bin/env python3 """ Automated multi-agent testing pipeline """ import requests import time import json from typing import Dict, List from dataclasses import dataclass from datetime import datetime @dataclass class TestConfiguration: """Configuration for a test run.""" suite_id: str sandbox_configs: List[str] description: str expected_agents: List[str] @dataclass class TestResults: """Results from a test execution.""" run_id: str suite_id: str status: str total_cases: int passed_cases: int failed_cases: int warning_cases: int execution_time_ms: int agent_results: List[Dict] class MultiAgentTestRunner: """Automated test runner for multi-agent comparison.""" def __init__(self, base_url: str = "http://localhost:3000/api/v1"): self.base_url = base_url def run_multi_agent_test(self, config: TestConfiguration) -> TestResults: """Execute tests across multiple agent configurations.""" print(f"š Starting multi-agent test: {config.description}") print(f" Suite ID: {config.suite_id}") print(f" Sandbox configs: {config.sandbox_configs}") # Start the test run start_time = time.time() response = requests.post( f"{self.base_url}/testsets/{config.suite_id}/run", json={"selected_configs": config.sandbox_configs} ) if response.status_code != 200: raise Exception(f"Failed to start test run: {response.text}") run_data = response.json() run_id = run_data.get("id") or run_data.get("run_id") print(f" Run ID: {run_id}") # Monitor test execution print("ā³ Monitoring test execution...") results = self._monitor_test_execution(run_id) execution_time = time.time() - start_time print(f"ā Test execution completed in {execution_time:.2f}s") return results def _monitor_test_execution(self, run_id: str) -> TestResults: """Monitor test execution until completion.""" last_status = None start_time = time.time() while True: response = requests.get(f"{self.base_url}/testruns/{run_id}") if response.status_code != 200: raise Exception(f"Failed to get test run status: {response.text}") run_data = response.json() status = run_data.get("status") if status != last_status: print(f" Status: {status}") last_status = status if status in ["Passed", "Failed", "Warning"]: # Test completed return TestResults( run_id=run_id, suite_id=run_data.get("suite_id"), status=status, total_cases=run_data.get("total_cases", 0), passed_cases=run_data.get("passed_cases", 0), failed_cases=run_data.get("failed_cases", 0), warning_cases=run_data.get("warning_cases", 0), execution_time_ms=run_data.get("total_time_ms", 0), agent_results=run_data.get("agent_results", []) ) elif status == "Running": # Still running, wait and check again time.sleep(10) # Print progress update elapsed = time.time() - start_time if elapsed > 0 and run_data.get("total_cases", 0) > 0: completed = run_data.get("passed_cases", 0) + run_data.get("failed_cases", 0) progress = completed / run_data.get("total_cases") * 100 print(f" Progress: {progress:.1f}% ({completed}/{run_data.get('total_cases')}) - {elapsed:.1f}s elapsed") else: raise Exception(f"Unexpected test status: {status}") def generate_comparison_report(self, results: TestResults) -> Dict: """Generate a detailed comparison report.""" report = { "summary": { "run_id": results.run_id, "status": results.status, "total_cases": results.total_cases, "execution_time_seconds": results.execution_time_ms / 1000, "overall_success_rate": results.passed_cases / results.total_cases if results.total_cases > 0 else 0 }, "agent_comparison": [], "performance_metrics": {}, "recommendations": [] } # Analyze each agent's performance for agent_result in results.agent_results: agent_analysis = self._analyze_agent_performance(agent_result) report["agent_comparison"].append(agent_analysis) # Generate performance insights report["performance_metrics"] = self._calculate_performance_metrics(results) # Generate recommendations report["recommendations"] = self._generate_recommendations(results) return report def _analyze_agent_performance(self, agent_result: Dict) -> Dict: """Analyze individual agent performance.""" total_cases = agent_result.get("passed", 0) + agent_result.get("failed", 0) + agent_result.get("warnings", 0) success_rate = agent_result.get("passed", 0) / total_cases if total_cases > 0 else 0 # Analyze case results for more detailed metrics case_results = agent_result.get("case_results", []) response_times = [case.get("response_time_ms", 0) for case in case_results] token_usage = [case.get("total_tokens", 0) for case in case_results] return { "sandbox_config_id": agent_result.get("sandbox_config_id"), "success_rate": success_rate, "total_cases": total_cases, "passed": agent_result.get("passed", 0), "failed": agent_result.get("failed", 0), "warnings": agent_result.get("warnings", 0), "performance": { "avg_response_time_ms": sum(response_times) / len(response_times) if response_times else 0, "max_response_time_ms": max(response_times) if response_times else 0, "avg_tokens": sum(token_usage) / len(token_usage) if token_usage else 0, "total_tokens": sum(token_usage) } } def _calculate_performance_metrics(self, results: TestResults) -> Dict: """Calculate cross-agent performance metrics.""" all_response_times = [] all_token_usage = [] for agent_result in results.agent_results: case_results = agent_result.get("case_results", []) all_response_times.extend([case.get("response_time_ms", 0) for case in case_results]) all_token_usage.extend([case.get("total_tokens", 0) for case in case_results]) return { "response_time": { "average_ms": sum(all_response_times) / len(all_response_times) if all_response_times else 0, "median_ms": sorted(all_response_times)[len(all_response_times)//2] if all_response_times else 0, "p95_ms": sorted(all_response_times)[int(len(all_response_times)*0.95)] if all_response_times else 0 }, "token_usage": { "average_per_request": sum(all_token_usage) / len(all_token_usage) if all_token_usage else 0, "total_tokens": sum(all_token_usage) } } def _generate_recommendations(self, results: TestResults) -> List[str]: """Generate recommendations based on test results.""" recommendations = [] # Find best performing agent best_agent = None best_success_rate = 0 for agent_result in results.agent_results: total = agent_result.get("passed", 0) + agent_result.get("failed", 0) + agent_result.get("warnings", 0) success_rate = agent_result.get("passed", 0) / total if total > 0 else 0 if success_rate > best_success_rate: best_success_rate = success_rate best_agent = agent_result.get("sandbox_config_id") if best_agent: recommendations.append(f"Best performing configuration: {best_agent} with {best_success_rate:.1%} success rate") # Performance recommendations if results.execution_time_ms > 300000: # 5 minutes recommendations.append("Consider optimizing response times - tests took longer than expected") if results.failed_cases > results.total_cases * 0.1: # >10% failure rate recommendations.append("High failure rate detected - review failed test cases and agent configurations") return recommendations def main(): """Run the multi-agent testing pipeline.""" runner = MultiAgentTestRunner() # Define test configuration test_config = TestConfiguration( suite_id="your-test-suite-id", # Replace with actual suite ID sandbox_configs=[ "conservative-config-id", "fast-config-id", "balanced-config-id" ], description="Customer Support Agent Comparison Test", expected_agents=["conservative-support-gpt4", "fast-support-gpt35", "empathetic-support-claude"] ) try: # Run the tests results = runner.run_multi_agent_test(test_config) # Generate and display report report = runner.generate_comparison_report(results) print("\n" + "="*60) print("šÆ MULTI-AGENT TEST RESULTS") print("="*60) print(f"\nš Summary:") print(f" Status: {results.status}") print(f" Total cases: {results.total_cases}") print(f" Passed: {results.passed_cases}") print(f" Failed: {results.failed_cases}") print(f" Warnings: {results.warning_cases}") print(f" Execution time: {results.execution_time_ms/1000:.2f}s") print(f"\nš¤ Agent Comparison:") for agent_analysis in report["agent_comparison"]: print(f" {agent_analysis['sandbox_config_id']}:") print(f" Success rate: {agent_analysis['success_rate']:.1%}") print(f" Avg response time: {agent_analysis['performance']['avg_response_time_ms']:.0f}ms") print(f" Avg tokens: {agent_analysis['performance']['avg_tokens']:.0f}") print(f"\nš” Recommendations:") for rec in report["recommendations"]: print(f" ⢠{rec}") # Save detailed report with open(f"test_report_{results.run_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json", "w") as f: json.dump(report, f, indent=2) print(f"\nš¾ Detailed report saved to: test_report_{results.run_id}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json") except Exception as e: print(f"ā Test execution failed: {str(e)}") return False return True if __name__ == "__main__": success = main() exit(0 if success else 1)
4.2 Execute Multi-Agent Tests
Phase 5: Performance Analysis & Comparison
5.1 Create Performance Analysis Dashboard
#!/usr/bin/env python3 """ Performance analysis and visualization for multi-agent test results """ import json import matplotlib.pyplot as plt import pandas as pd import seaborn as sns from typing import Dict, List import requests class PerformanceAnalyzer: """Analyze and visualize multi-agent test performance.""" def __init__(self, base_url: str = "http://localhost:3000/api/v1"): self.base_url = base_url def analyze_test_run(self, run_id: str) -> Dict: """Perform comprehensive analysis of a test run.""" # Get test run data response = requests.get(f"{self.base_url}/testruns/{run_id}") if response.status_code != 200: raise Exception(f"Failed to get test run: {response.text}") run_data = response.json() # Generate comprehensive analysis analysis = { "overview": self._analyze_overview(run_data), "agent_comparison": self._compare_agents(run_data), "performance_trends": self._analyze_performance_trends(run_data), "failure_analysis": self._analyze_failures(run_data), "recommendations": self._generate_insights(run_data) } return analysis def _analyze_overview(self, run_data: Dict) -> Dict: """Analyze overall test run metrics.""" total_cases = run_data.get("total_cases", 0) passed = run_data.get("passed_cases", 0) failed = run_data.get("failed_cases", 0) warnings = run_data.get("warning_cases", 0) return { "total_cases": total_cases, "success_rate": passed / total_cases if total_cases > 0 else 0, "failure_rate": failed / total_cases if total_cases > 0 else 0, "warning_rate": warnings / total_cases if total_cases > 0 else 0, "execution_time_minutes": run_data.get("total_time_ms", 0) / 60000, "total_tokens": run_data.get("total_tokens", 0), "agents_tested": len(run_data.get("agent_results", [])) } def _compare_agents(self, run_data: Dict) -> List[Dict]: """Compare performance across different agents.""" agent_comparisons = [] for agent_result in run_data.get("agent_results", []): config_id = agent_result.get("sandbox_config_id") # Calculate metrics total_cases = agent_result.get("passed", 0) + agent_result.get("failed", 0) + agent_result.get("warnings", 0) success_rate = agent_result.get("passed", 0) / total_cases if total_cases > 0 else 0 # Analyze case-level performance case_results = agent_result.get("case_results", []) response_times = [case.get("response_time_ms", 0) for case in case_results if case.get("response_time_ms")] token_counts = [case.get("total_tokens", 0) for case in case_results if case.get("total_tokens")] agent_comparisons.append({ "config_id": config_id, "success_rate": success_rate, "avg_response_time_ms": sum(response_times) / len(response_times) if response_times else 0, "p95_response_time_ms": sorted(response_times)[int(len(response_times) * 0.95)] if response_times else 0, "avg_tokens": sum(token_counts) / len(token_counts) if token_counts else 0, "total_tokens": sum(token_counts), "passed_cases": agent_result.get("passed", 0), "failed_cases": agent_result.get("failed", 0), "warning_cases": agent_result.get("warnings", 0) }) return sorted(agent_comparisons, key=lambda x: x["success_rate"], reverse=True) def _analyze_performance_trends(self, run_data: Dict) -> Dict: """Analyze performance trends and patterns.""" all_case_results = [] for agent_result in run_data.get("agent_results", []): for case in agent_result.get("case_results", []): case["agent_config"] = agent_result.get("sandbox_config_id") all_case_results.append(case) if not all_case_results: return {"error": "No case results available for trend analysis"} # Convert to DataFrame for analysis df = pd.DataFrame(all_case_results) trends = { "response_time_distribution": { "mean": df["response_time_ms"].mean() if "response_time_ms" in df else 0, "std": df["response_time_ms"].std() if "response_time_ms" in df else 0, "min": df["response_time_ms"].min() if "response_time_ms" in df else 0, "max": df["response_time_ms"].max() if "response_time_ms" in df else 0 }, "token_usage_distribution": { "mean": df["total_tokens"].mean() if "total_tokens" in df else 0, "std": df["total_tokens"].std() if "total_tokens" in df else 0, "min": df["total_tokens"].min() if "total_tokens" in df else 0, "max": df["total_tokens"].max() if "total_tokens" in df else 0 } } return trends def _analyze_failures(self, run_data: Dict) -> Dict: """Analyze failure patterns and common issues.""" failure_patterns = { "total_failures": 0, "failure_by_agent": {}, "common_failure_reasons": [], "failure_distribution": {} } for agent_result in run_data.get("agent_results", []): config_id = agent_result.get("sandbox_config_id") failed_cases = agent_result.get("failed", 0) failure_patterns["total_failures"] += failed_cases failure_patterns["failure_by_agent"][config_id] = failed_cases # Analyze individual case failures for case in agent_result.get("case_results", []): if case.get("status") == "Failed": failure_reason = case.get("details", "Unknown failure") failure_patterns["common_failure_reasons"].append(failure_reason) return failure_patterns def _generate_insights(self, run_data: Dict) -> List[str]: """Generate actionable insights from test results.""" insights = [] agent_results = run_data.get("agent_results", []) if not agent_results: return ["No agent results available for analysis"] # Find best and worst performing agents best_agent = max(agent_results, key=lambda x: x.get("passed", 0)) worst_agent = min(agent_results, key=lambda x: x.get("passed", 0)) best_config = best_agent.get("sandbox_config_id") worst_config = worst_agent.get("sandbox_config_id") insights.append(f"Best performing configuration: {best_config}") insights.append(f"Lowest performing configuration: {worst_config}") # Performance insights total_cases = run_data.get("total_cases", 0) overall_success_rate = run_data.get("passed_cases", 0) / total_cases if total_cases > 0 else 0 if overall_success_rate < 0.8: insights.append("Overall success rate is below 80% - consider reviewing agent configurations") elif overall_success_rate > 0.95: insights.append("Excellent success rate achieved - current configurations are performing well") # Time-based insights execution_time_ms = run_data.get("total_time_ms", 0) if execution_time_ms > 300000: # 5 minutes insights.append("Test execution time is high - consider optimizing response times or reducing test scope") return insights def create_performance_visualizations(self, analysis: Dict, output_dir: str = "./visualizations"): """Create visualizations for performance analysis.""" import os os.makedirs(output_dir, exist_ok=True) # Set style plt.style.use('default') sns.set_palette("husl") # 1. Agent Success Rate Comparison agent_data = analysis["agent_comparison"] if agent_data: fig, ax = plt.subplots(figsize=(12, 6)) configs = [agent["config_id"] for agent in agent_data] success_rates = [agent["success_rate"] * 100 for agent in agent_data] bars = ax.bar(configs, success_rates) ax.set_title("Agent Success Rate Comparison", fontsize=16, fontweight='bold') ax.set_ylabel("Success Rate (%)", fontsize=12) ax.set_xlabel("Agent Configuration", fontsize=12) ax.set_ylim(0, 100) # Add value labels on bars for bar, rate in zip(bars, success_rates): ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, f"{rate:.1f}%", ha='center', va='bottom', fontweight='bold') plt.xticks(rotation=45, ha='right') plt.tight_layout() plt.savefig(f"{output_dir}/success_rate_comparison.png", dpi=300, bbox_inches='tight') plt.close() # 2. Response Time Comparison if agent_data: fig, ax = plt.subplots(figsize=(12, 6)) response_times = [agent["avg_response_time_ms"] for agent in agent_data] bars = ax.bar(configs, response_times) ax.set_title("Average Response Time by Agent", fontsize=16, fontweight='bold') ax.set_ylabel("Response Time (ms)", fontsize=12) ax.set_xlabel("Agent Configuration", fontsize=12) # Add value labels for bar, time in zip(bars, response_times): ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(response_times)*0.01, f"{time:.0f}ms", ha='center', va='bottom', fontweight='bold') plt.xticks(rotation=45, ha='right') plt.tight_layout() plt.savefig(f"{output_dir}/response_time_comparison.png", dpi=300, bbox_inches='tight') plt.close() # 3. Token Usage Comparison if agent_data: fig, ax = plt.subplots(figsize=(12, 6)) token_usage = [agent["avg_tokens"] for agent in agent_data] bars = ax.bar(configs, token_usage) ax.set_title("Average Token Usage by Agent", fontsize=16, fontweight='bold') ax.set_ylabel("Average Tokens per Request", fontsize=12) ax.set_xlabel("Agent Configuration", fontsize=12) # Add value labels for bar, tokens in zip(bars, token_usage): ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + max(token_usage)*0.01, f"{tokens:.0f}", ha='center', va='bottom', fontweight='bold') plt.xticks(rotation=45, ha='right') plt.tight_layout() plt.savefig(f"{output_dir}/token_usage_comparison.png", dpi=300, bbox_inches='tight') plt.close() print(f"š Visualizations saved to {output_dir}/") def main(): """Run performance analysis on a test run.""" analyzer = PerformanceAnalyzer() # Replace with actual run ID run_id = input("Enter test run ID to analyze: ").strip() try: print(f"š Analyzing test run: {run_id}") analysis = analyzer.analyze_test_run(run_id) # Display results print("\n" + "="*60) print("š PERFORMANCE ANALYSIS RESULTS") print("="*60) overview = analysis["overview"] print(f"\nš Overview:") print(f" Total test cases: {overview['total_cases']}") print(f" Success rate: {overview['success_rate']:.1%}") print(f" Execution time: {overview['execution_time_minutes']:.1f} minutes") print(f" Total tokens used: {overview['total_tokens']:,}") print(f" Agents tested: {overview['agents_tested']}") print(f"\nš¤ Agent Rankings:") for i, agent in enumerate(analysis["agent_comparison"], 1): print(f" {i}. {agent['config_id']}") print(f" Success rate: {agent['success_rate']:.1%}") print(f" Avg response time: {agent['avg_response_time_ms']:.0f}ms") print(f" Avg tokens: {agent['avg_tokens']:.0f}") print(f"\nš” Key Insights:") for insight in analysis["recommendations"]: print(f" ⢠{insight}") # Generate visualizations try: analyzer.create_performance_visualizations(analysis) except ImportError: print("\nā ļø Visualization libraries not available (matplotlib, seaborn)") print(" Install with: pip install matplotlib seaborn pandas") # Save detailed analysis with open(f"performance_analysis_{run_id}.json", "w") as f: json.dump(analysis, f, indent=2) print(f"\nš¾ Detailed analysis saved to: performance_analysis_{run_id}.json") except Exception as e: print(f"ā Analysis failed: {str(e)}") if __name__ == "__main__": main()
Phase 6: CI/CD Integration
6.1 Create GitHub Actions Workflow
Create .github/workflows/agent-testing.yml:
name: Multi-Agent Testing Pipeline on: push: branches: [ main, develop ] pull_request: branches: [ main ] schedule: # Run daily at 2 AM UTC - cron: '0 2 * * *' jobs: multi-agent-test: runs-on: ubuntu-latest services: coagent: image: coagent:latest ports: - 3000:3000 env: RUST_LOG: info steps: - name: Checkout code uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.9' - name: Install dependencies run: | pip install requests pandas matplotlib seaborn - name: Wait for CoAgent to start run: | timeout 60 bash -c 'until curl -f http://localhost:3000/health; do sleep 2; done' - name: Set up test environment env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: | python create_test_environment.py - name: Run multi-agent tests run: | python multi_agent_test_runner.py > test_results.txt 2>&1 echo "TEST_EXIT_CODE=$?" >> $GITHUB_ENV - name: Generate performance report if: always() run: | python performance_analyzer.py --automated - name: Upload test artifacts if: always() uses: actions/upload-artifact@v3 with: name: test-results path: | test_results.txt test_report_*.json performance_analysis_*.json visualizations/ - name: Post results to PR if: github.event_name == 'pull_request' uses: actions/github-script@v6 with: script: | const fs = require('fs'); try { const results = fs.readFileSync('test_results.txt', 'utf8'); const lines = results.split('\n'); const summary = lines.filter(line => line.includes('Status:') || line.includes('Success rate:') || line.includes('Execution time:') ).join('\n'); await github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: `## š¤ Multi-Agent Test Results\n\n\`\`\`\n${summary}\n\`\`\`\n\nFull results available in the workflow artifacts.` }); } catch (error) { console.log('Could not post results:', error); } - name: Fail if tests failed if: env.TEST_EXIT_CODE != '0' run
6.2 Create Automated Environment Setup
Create create_test_environment.py:
#!/usr/bin/env python3 """ Automated test environment setup for CI/CD """ import os import requests import json import time def wait_for_coagent(url: str = "http://localhost:3000", timeout: int = 60): """Wait for CoAgent to be ready.""" start_time = time.time() while time.time() - start_time < timeout: try: response = requests.get(f"{url}/health") if response.status_code == 200: print("ā CoAgent is ready") return True except requests.exceptions.ConnectionError: pass time.sleep(2) raise Exception("CoAgent failed to start within timeout") def setup_providers(): """Set up model providers from environment variables.""" providers = [] # OpenAI openai_key = os.environ.get("OPENAI_API_KEY") if openai_key: providers.append({ "name": "OpenAI CI", "provider_type": "openai", "api_key": openai_key, "available_models": ["gpt-4", "gpt-3.5-turbo"] }) # Anthropic anthropic_key = os.environ.get("ANTHROPIC_API_KEY") if anthropic_key: providers.append({ "name": "Anthropic CI", "provider_type": "anthropic", "api_key": anthropic_key, "available_models": ["claude-3-sonnet", "claude-3-haiku"] }) # Create providers for provider in providers: response = requests.post("http://localhost:3000/api/v1/providers", json=provider) if response.status_code == 200: print(f"ā Created provider: {provider['name']}") else: print(f"ā Failed to create provider: {provider['name']}") print(f" Error: {response.text}") def setup_agents_and_configs(): """Set up test agents and configurations.""" # This would contain the same setup logic as previous phases # Simplified for CI/CD environment agents = [ { "name": "ci-test-agent-conservative", "description": "Conservative test agent for CI", "preamble": "You are a careful, accurate customer support agent." }, { "name": "ci-test-agent-fast", "description": "Fast response test agent for CI", "preamble": "You are a quick, efficient customer support agent." } ] for agent in agents: response = requests.post("http://localhost:3000/api/v1/agents", json=agent) if response.status_code == 200: print(f"ā Created agent: {agent['name']}") else: print(f"ā Failed to create agent: {agent['name']}") def main(): """Set up complete test environment for CI/CD.""" print("š Setting up test environment for CI/CD...") # Wait for CoAgent to be ready wait_for_coagent() # Set up providers setup_providers() # Set up agents and configurations setup_agents_and_configs() print("ā Test environment setup complete!") if __name__ == "__main__": main()
Phase 7: Advanced Testing Strategies
7.1 Load Testing for Multi-Agent Scenarios
#!/usr/bin/env python3 """ Load testing for multi-agent scenarios """ import asyncio import aiohttp import time import json from typing import List, Dict from dataclasses import dataclass import statistics @dataclass class LoadTestConfig: """Configuration for load testing.""" concurrent_requests: int total_requests: int ramp_up_seconds: int test_prompts: List[str] agent_configs: List[str] class MultiAgentLoadTester: """Load testing framework for multi-agent scenarios.""" def __init__(self, base_url: str = "http://localhost:3000/api/v1"): self.base_url = base_url self.results = [] async def run_load_test(self, config: LoadTestConfig) -> Dict: """Execute load test across multiple agents.""" print(f"š Starting load test:") print(f" Concurrent requests: {config.concurrent_requests}") print(f" Total requests: {config.total_requests}") print(f" Agent configurations: {len(config.agent_configs)}") start_time = time.time() # Create semaphore to control concurrency semaphore = asyncio.Semaphore(config.concurrent_requests) # Generate request tasks tasks = [] for i in range(config.total_requests): # Round-robin through agents and prompts agent_config = config.agent_configs[i % len(config.agent_configs)] prompt = config.test_prompts[i % len(config.test_prompts)] # Add ramp-up delay delay = (i / config.total_requests) * config.ramp_up_seconds task = asyncio.create_task( self._make_request(semaphore, agent_config, prompt, delay, i) ) tasks.append(task) # Execute all tasks results = await asyncio.gather(*tasks, return_exceptions=True) total_time = time.time() - start_time # Analyze results analysis = self._analyze_load_test_results(results, total_time) return analysis async def _make_request(self, semaphore: asyncio.Semaphore, agent_config: str, prompt: str, delay: float, request_id: int) -> Dict: """Make a single request with timing and error tracking.""" # Wait for ramp-up delay if delay > 0: await asyncio.sleep(delay) async with semaphore: request_start = time.time() try: async with aiohttp.ClientSession() as session: # Create a simple evaluation request payload = { "run_id": f"load-test-{request_id}", "preamble": "You are a helpful customer support agent.", "prompt": prompt, "model_ref": { "provider_id": "openai-ci", "model_name": "gpt-3.5-turbo" }, "log_meta": {"load_test": True, "agent_config": agent_config} } async with session.post(f"{self.base_url}/evals", json=payload) as response: response_time = time.time() - request_start if response.status == 200: data = await response.json() return { "success": True, "response_time": response_time, "agent_config": agent_config, "request_id": request_id, "token_usage": data.get("token_usage", {}), "response_length": len(data.get("response", "")) } else: return { "success": False, "response_time": response_time, "agent_config": agent_config, "request_id": request_id, "error": f"HTTP {response.status}", "error_detail": await response.text() } except Exception as e: response_time = time.time() - request_start return { "success": False, "response_time": response_time, "agent_config": agent_config, "request_id": request_id, "error": str(e) } def _analyze_load_test_results(self, results: List, total_time: float) -> Dict: """Analyze load test results and generate metrics.""" successful_requests = [r for r in results if isinstance(r, dict) and r.get("success")] failed_requests = [r for r in results if isinstance(r, dict) and not r.get("success")] exceptions = [r for r in results if isinstance(r, Exception)] analysis = { "summary": { "total_requests": len(results), "successful_requests": len(successful_requests), "failed_requests": len(failed_requests) + len(exceptions), "success_rate": len(successful_requests) / len(results), "total_test_time": total_time, "requests_per_second": len(results) / total_time }, "performance_metrics": {}, "error_analysis": {}, "agent_comparison": {} } if successful_requests: response_times = [r["response_time"] for r in successful_requests] analysis["performance_metrics"] = { "avg_response_time": statistics.mean(response_times), "median_response_time": statistics.median(response_times), "p95_response_time": statistics.quantiles(response_times, n=20)[18] if len(response_times) > 20 else max(response_times), "min_response_time": min(response_times), "max_response_time": max(response_times), "response_time_std": statistics.stdev(response_times) if len(response_times) > 1 else 0 } # Analyze errors error_types = {} for failed_request in failed_requests + exceptions: if isinstance(failed_request, Exception): error_type = type(failed_request).__name__ else: error_type = failed_request.get("error", "Unknown") error_types[error_type] = error_types.get(error_type, 0) + 1 analysis["error_analysis"] = { "error_types": error_types, "most_common_error": max(error_types.items(), key=lambda x: x[1])[0] if error_types else None } # Agent-specific analysis agent_performance = {} for request in successful_requests: agent_config = request["agent_config"] if agent_config not in agent_performance: agent_performance[agent_config] = [] agent_performance[agent_config].append(request["response_time"]) for agent, response_times in agent_performance.items(): analysis["agent_comparison"][agent] = { "request_count": len(response_times), "avg_response_time": statistics.mean(response_times), "success_rate": len(response_times) / len([r for r in results if r.get("agent_config") == agent]) } return analysis def main(): """Run multi-agent load test.""" config = LoadTestConfig( concurrent_requests=10, total_requests=100, ramp_up_seconds=30, test_prompts=[ "I need help with returning a product", "What's the status of my order?", "I was charged incorrectly", "Can you help me find a product?", "I need to update my account information" ], agent_configs=[ "conservative-config", "fast-config", "balanced-config" ] ) tester = MultiAgentLoadTester() # Run the load test try: analysis = asyncio.run(tester.run_load_test(config)) print("\n" + "="*60) print("ā” LOAD TEST RESULTS") print("="*60) summary = analysis["summary"] print(f"\nš Summary:") print(f" Total requests: {summary['total_requests']}") print(f" Successful: {summary['successful_requests']}") print(f" Failed: {summary['failed_requests']}") print(f" Success rate: {summary['success_rate']:.1%}") print(f" Requests/second: {summary['requests_per_second']:.2f}") print(f" Total time: {summary['total_test_time']:.2f}s") if analysis["performance_metrics"]: perf = analysis["performance_metrics"] print(f"\nā±ļø Performance:") print(f" Avg response time: {perf['avg_response_time']:.2f}s") print(f" Median response time: {perf['median_response_time']:.2f}s") print(f" 95th percentile: {perf['p95_response_time']:.2f}s") print(f" Min/Max: {perf['min_response_time']:.2f}s / {perf['max_response_time']:.2f}s") if analysis["agent_comparison"]: print(f"\nš¤ Agent Performance:") for agent, metrics in analysis["agent_comparison"].items(): print(f" {agent}:") print(f" Requests: {metrics['request_count']}") print(f" Avg time: {metrics['avg_response_time']:.2f}s") print(f" Success rate: {metrics['success_rate']:.1%}") if analysis["error_analysis"]["error_types"]: print(f"\nā Errors:") for error_type, count in analysis["error_analysis"]["error_types"].items(): print(f" {error_type}: {count}") # Save detailed results with open(f"load_test_results_{int(time.time())}.json", "w") as f: json.dump(analysis, f, indent=2) except Exception as e: print(f"ā Load test failed: {str(e)}") return False return True if __name__ == "__main__": success = main() exit(0 if success else 1)
7.2 A/B Testing Framework
#!/usr/bin/env python3 """ A/B Testing framework for comparing agent configurations """ import random import requests import json import time from typing import Dict, List, Tuple from dataclasses import dataclass import statistics @dataclass class ABTestConfig: """Configuration for A/B testing.""" variant_a_config: str variant_b_config: str test_prompts: List[str] sample_size: int significance_level: float = 0.05 class ABTester: """A/B testing framework for agent configurations.""" def __init__(self, base_url: str = "http://localhost:3000/api/v1"): self.base_url = base_url def run_ab_test(self, config: ABTestConfig) -> Dict: """Run A/B test comparing two agent configurations.""" print(f"š¬ Starting A/B Test:") print(f" Variant A: {config.variant_a_config}") print(f" Variant B: {config.variant_b_config}") print(f" Sample size: {config.sample_size} per variant") results_a = [] results_b = [] # Run tests for both variants total_tests = config.sample_size * 2 for i in range(total_tests): # Randomize assignment to variants if random.random() < 0.5 and len(results_a) < config.sample_size: variant = "A" variant_config = config.variant_a_config results_list = results_a elif len(results_b) < config.sample_size: variant = "B" variant_config = config.variant_b_config results_list = results_b else: variant = "A" variant_config = config.variant_a_config results_list = results_a # Random prompt selection prompt = random.choice(config.test_prompts) # Execute test result = self._execute_single_test(variant_config, prompt, variant, i) results_list.append(result) # Progress update if (i + 1) % 10 == 0: progress = (i + 1) / total_tests * 100 print(f" Progress: {progress:.1f}% ({len(results_a)} A, {len(results_b)} B)") # Analyze results analysis = self._analyze_ab_results(results_a, results_b, config) return analysis def _execute_single_test(self, config_id: str, prompt: str, variant: str, test_id: int) -> Dict: """Execute a single test case.""" start_time = time.time() try: # Create evaluation request payload = { "run_id": f"ab-test-{variant.lower()}-{test_id}", "preamble": "You are a helpful customer support agent.", "prompt": prompt, "model_ref": { "provider_id": "openai-ci", "model_name": "gpt-3.5-turbo" }, "log_meta": { "ab_test": True, "variant": variant, "config_id": config_id } } response = requests.post(f"{self.base_url}/evals", json=payload) response_time = time.time() - start_time if response.status_code == 200: data = response.json() return { "success": True, "variant": variant, "config_id": config_id, "prompt": prompt, "response": data.get("response", ""), "response_time": response_time, "token_usage": data.get("token_usage", {}), "test_id": test_id } else: return { "success": False, "variant": variant, "config_id": config_id, "error": f"HTTP {response.status_code}", "response_time": response_time, "test_id": test_id } except Exception as e: response_time = time.time() - start_time return { "success": False, "variant": variant, "config_id": config_id, "error": str(e), "response_time": response_time, "test_id": test_id } def _analyze_ab_results(self, results_a: List[Dict], results_b: List[Dict], config: ABTestConfig) -> Dict: """Analyze A/B test results with statistical significance testing.""" # Calculate success rates success_a = sum(1 for r in results_a if r["success"]) success_b = sum(1 for r in results_b if r["success"]) success_rate_a = success_a / len(results_a) if results_a else 0 success_rate_b = success_b / len(results_b) if results_b else 0 # Calculate response times for successful requests times_a = [r["response_time"] for r in results_a if r["success"]] times_b = [r["response_time"] for r in results_b if r["success"]] # Statistical significance testing (simplified) significance_test = self._simple_significance_test( success_a, len(results_a), success_b, len(results_b), config.significance_level ) analysis = { "test_config": { "variant_a": config.variant_a_config, "variant_b": config.variant_b_config, "sample_size_per_variant": config.sample_size, "significance_level": config.significance_level }, "results": { "variant_a": { "total_tests": len(results_a), "successful_tests": success_a, "success_rate": success_rate_a, "avg_response_time": statistics.mean(times_a) if times_a else 0, "median_response_time": statistics.median(times_a) if times_a else 0 }, "variant_b": { "total_tests": len(results_b), "successful_tests": success_b, "success_rate": success_rate_b, "avg_response_time": statistics.mean(times_b) if times_b else 0, "median_response_time": statistics.median(times_b) if times_b else 0 } }, "comparison": { "success_rate_difference": success_rate_b - success_rate_a, "response_time_difference": (statistics.mean(times_b) if times_b else 0) - (statistics.mean(times_a) if times_a else 0), "statistically_significant": significance_test["significant"], "confidence_level": 1 - config.significance_level, "winner": self._determine_winner(success_rate_a, success_rate_b, times_a, times_b, significance_test["significant"]) }, "recommendations": self._generate_ab_recommendations(success_rate_a, success_rate_b, times_a, times_b, significance_test) } return analysis def _simple_significance_test(self, successes_a: int, total_a: int, successes_b: int, total_b: int, alpha: float) -> Dict: """Simplified significance test for success rates.""" # This is a simplified implementation # In production, you'd want to use proper statistical libraries p_a = successes_a / total_a if total_a > 0 else 0 p_b = successes_b / total_b if total_b > 0 else 0 # Pool proportion for standard error calculation p_pool = (successes_a + successes_b) / (total_a + total_b) if (total_a + total_b) > 0 else 0 # Standard error se = (p_pool * (1 - p_pool) * (1/total_a + 1/total_b)) ** 0.5 if p_pool > 0 and total_a > 0 and total_b > 0 else 0 # Z-score z_score = (p_b - p_a) / se if se > 0 else 0 # Critical value for two-tailed test (simplified) critical_value = 1.96 if alpha == 0.05 else 2.58 # approximation significant = abs(z_score) > critical_value return { "significant": significant, "z_score": z_score, "p_value_approx": 2 * (1 - abs(z_score)/2) if abs(z_score) < 2 else 0.05, # very rough approximation "critical_value": critical_value } def _determine_winner(self, success_rate_a: float, success_rate_b: float, times_a: List[float], times_b: List[float], significant: bool) -> str: """Determine the winning variant based on multiple criteria.""" if not significant: return "No significant difference" # Primary criterion: success rate if success_rate_b > success_rate_a: primary_winner = "B" elif success_rate_a > success_rate_b: primary_winner = "A" else: primary_winner = "Tie" # Secondary criterion: response time (only if success rates are close) if abs(success_rate_b - success_rate_a) < 0.05: # Less than 5% difference avg_time_a = statistics.mean(times_a) if times_a else float('inf') avg_time_b = statistics.mean(times_b) if times_b else float('inf') if avg_time_a < avg_time_b: return f"{primary_winner} (A faster)" if primary_winner == "Tie" else f"{primary_winner} (also faster)" elif avg_time_b < avg_time_a: return f"{primary_winner} (B faster)" if primary_winner == "Tie" else f"{primary_winner} (also faster)" return primary_winner def _generate_ab_recommendations(self, success_rate_a: float, success_rate_b: float, times_a: List[float], times_b: List[float], significance_test: Dict) -> List[str]: """Generate actionable recommendations from A/B test results.""" recommendations = [] if significance_test["significant"]: if success_rate_b > success_rate_a: improvement = ((success_rate_b - success_rate_a) / success_rate_a) * 100 recommendations.append(f"Variant B shows {improvement:.1f}% improvement in success rate - recommend deployment") elif success_rate_a > success_rate_b: improvement = ((success_rate_a - success_rate_b) / success_rate_b) * 100 recommendations.append(f"Variant A shows {improvement:.1f}% improvement in success rate - recommend keeping current config") else: recommendations.append("No statistically significant difference found - consider longer test or different metrics") # Response time recommendations if times_a and times_b: avg_time_a = statistics.mean(times_a) avg_time_b = statistics.mean(times_b) time_diff = abs(avg_time_b - avg_time_a) if time_diff > 0.5: # More than 0.5 second difference faster_variant = "A" if avg_time_a < avg_time_b else "B" recommendations.append(f"Variant {faster_variant} is significantly faster ({time_diff:.2f}s difference)") # Sample size recommendations if not significance_test["significant"]: recommendations.append("Consider increasing sample size for more reliable results") return recommendations def main(): """Run A/B test example.""" config = ABTestConfig( variant_a_config="conservative-support-config", variant_b_config="fast-support-config", test_prompts=[ "I need help with a product return", "What's my order status?", "I have a billing question", "Can you help me find a product?", "I need technical support" ], sample_size=50, significance_level=0.05 ) tester = ABTester() try: analysis = tester.run_ab_test(config) print("\n" + "="*60) print("š¬ A/B TEST RESULTS") print("="*60) results = analysis["results"] comparison = analysis["comparison"] print(f"\nš Results Summary:") print(f" Variant A ({config.variant_a_config}):") print(f" Success rate: {results['variant_a']['success_rate']:.1%}") print(f" Avg response time: {results['variant_a']['avg_response_time']:.2f}s") print(f" Variant B ({config.variant_b_config}):") print(f" Success rate: {results['variant_b']['success_rate']:.1%}") print(f" Avg response time: {results['variant_b']['avg_response_time']:.2f}s") print(f"\nšÆ Comparison:") print(f" Success rate difference: {comparison['success_rate_difference']:+.1%}") print(f" Response time difference: {comparison['response_time_difference']:+.2f}s") print(f" Statistically significant: {comparison['statistically_significant']}") print(f" Winner: {comparison['winner']}") print(f"\nš” Recommendations:") for rec in analysis["recommendations"]: print(f" ⢠{rec}") # Save results with open(f"ab_test_results_{int(time.time())}.json", "w") as f: json.dump(analysis, f, indent=2) print(f"\nš¾ Results saved to ab_test_results_{int(time.time())}.json") except Exception as e: print(f"ā A/B test failed: {str(e)}") return False return True if __name__ == "__main__": success = main() exit(0 if success else 1)
Summary
š Congratulations! You've built a comprehensive multi-agent testing pipeline that includes:
- ā Complete Test Suite Design - Structured test cases with multiple validation types 
- ā Multi-Agent Comparison - Automated testing across different configurations 
- ā Performance Analysis - Detailed metrics and visualizations 
- ā CI/CD Integration - Automated testing in development workflows 
- ā Advanced Testing Strategies - Load testing and A/B testing frameworks 
Key Benefits Achieved
- Systematic Comparison - Objective evaluation of agent performance 
- Automated Quality Assurance - Continuous testing prevents regressions 
- Data-Driven Decisions - Statistical significance testing for configuration choices 
- Performance Optimization - Identification of bottlenecks and optimization opportunities 
- Production Readiness - Comprehensive testing before deployment 
Next Steps
- Rust Client Integration Tutorial - High-performance production integrations 
- Testing & QA Guide - Advanced testing strategies and best practices 
- Monitoring Guide - Production monitoring and alerting 
- REST API Reference - Complete API documentation for custom testing tools