This document provides comprehensive reference for all validation types supported in CoAgent's Test Studio, including schemas, examples, and best practices for test case creation.
Overview
CoAgent Test Studio supports multiple validation types to comprehensively evaluate agent responses. Each validation type serves a specific purpose in ensuring agent quality and performance.
{"id_validation":"val-support-quality","kind":{"semantic_similarity":{"sentence":"I understand your concern and I'm here to help you resolve this issue quickly and efficiently","threshold":0.7}}}
Technical Accuracy
{"id_validation":"val-tech-accuracy","kind":{"semantic_similarity":{"sentence":"To troubleshoot this issue, first check your network connection and then restart the application","threshold":0.8}}}
Empathy Check
{"id_validation":"val-empathy","kind":{"semantic_similarity":{"sentence":"I'm sorry you're experiencing this problem and I want to make sure we get this resolved for you","threshold":0.6}}}
Threshold Guidelines
0.9-1.0: Nearly identical meaning
0.8-0.9: Very similar meaning, minor variations
0.7-0.8: Similar meaning, some differences in phrasing
0.6-0.7: Related concepts, different expression
Below 0.6: Different meanings
Use Cases
Validate response quality and appropriateness
Check for empathy and tone alignment
Verify technical accuracy across different phrasings
Ensure brand voice consistency
Tool Call Validation
Validates that the agent calls specific tools during response generation.
{"id_validation":"val-customer-satisfaction","kind":{"llm_v0":{"llm0":{"llm_criteria":"Rate this customer support response on a scale of 1-10 for helpfulness, empathy, and clarity. A response scores 7 or higher if it: 1) Directly addresses the customer's concern, 2) Shows understanding and empathy, 3) Provides clear next steps or solutions, 4) Maintains a professional and friendly tone. Return only the numeric score.","model_reference":{"provider_id":"openai-eval","provider_name":"OpenAI Evaluation","model_name":"gpt-4"}}}}}
Technical Accuracy Check
{"id_validation":"val-technical-accuracy","kind":{"llm_v0":{"llm0":{"llm_criteria":"Evaluate whether this technical response is factually accurate and follows best practices. Consider: 1) Technical correctness of information provided, 2) Completeness of the solution, 3) Safety of recommended steps, 4) Clarity of instructions. Rate 1-10, where 8+ means the response is technically sound and safe to follow.","model_reference":{"provider_id":"anthropic-eval","provider_name":"Anthropic Evaluation","model_name":"claude-3-sonnet"}}}}}
Compliance Verification
{"id_validation":"val-compliance-check","kind":{"llm_v0":{"llm0":{"llm_criteria":"Does this response comply with customer service guidelines? Check for: 1) Professional language use, 2) Appropriate data handling mentions, 3) Correct escalation procedures, 4) Brand voice alignment. Respond 'PASS' if compliant, 'FAIL' if not compliant, followed by specific reasons.","model_reference":{"provider_id":"openai-eval","provider_name":"OpenAI Evaluation","model_name":"gpt-4"}}}}}
Emotional Intelligence Assessment
{"id_validation":"val-emotional-intelligence","kind":{"llm_v0":{"llm0":{"llm_criteria":"Assess the emotional intelligence of this response. Rate 1-10 based on: 1) Recognition of customer emotions, 2) Appropriate empathetic response, 3) De-escalation techniques if needed, 4) Building rapport and trust. Explain your rating with specific examples from the response.","model_reference":{"provider_id":"anthropic-eval","provider_name":"Anthropic Evaluation","model_name":"claude-3-sonnet"}}}}}
Best Practices for LLM Criteria
Clear Scoring Instructions
Define specific scoring scales (1-10, Pass/Fail, etc.)
Provide clear success criteria
Explain what each score level means
Specific Evaluation Points
Break down evaluation into specific aspects
Provide concrete examples of what to look for
Include both positive and negative indicators
Output Format Specification
Specify exactly how the evaluator should respond
Request structured output when needed
Ask for explanations to make evaluations auditable
Use Cases
Subjective quality assessment
Complex reasoning evaluation
Brand voice and tone compliance
Context-aware appropriateness testing
Creative content evaluation
Complete Test Case Example
Here's a comprehensive test case using multiple validation types:
{"id_case":"comprehensive-support-test","input":{"human_prompt":"I bought a laptop last week but it won't turn on. I need to return it urgently as I have an important presentation tomorrow."},"validations":[{"id_validation":"val-empathy-check","kind":{"semantic_similarity":{"sentence":"I understand this is urgent and frustrating, especially with your important presentation coming up","threshold":0.7}}},{"id_validation":"val-contains-solution","kind":{"content_match":{"pattern":"(return|replacement|expedited|rush|priority)"}}},{"id_validation":"val-tool-usage","kind":{"tool_call":{"tool_name":"order_lookup"}}},{"id_validation":"val-response-time","kind":{"response_time":{"max_seconds":5}}},{"id_validation":"val-overall-quality","kind":{"llm_v0":{"llm0":{"llm_criteria":"Rate this customer support response 1-10 for: 1) Acknowledging urgency, 2) Showing empathy, 3) Providing clear next steps, 4) Offering appropriate solutions for time-sensitive issue. Score 8+ if response handles the urgent situation professionally and helpfully.","model_reference":{"provider_id":"openai-eval","provider_name":"OpenAI","model_name":"gpt-4"}}}}}],"bound_agent_name":"customer-support-agent"}
Testing Strategy Recommendations
Layered Validation Approach
Basic Structure: Use Response Schema validation
Content Quality: Apply Content Match and Semantic Similarity
Performance: Include Response Time validation
Tool Integration: Add Tool Call validation where applicable
Subjective Quality: Use LLM V0 for nuanced evaluation
Test Case Complexity Levels
Simple Tests
Single validation type
Clear pass/fail criteria
Basic functionality verification
Medium Tests
2-3 validation types
Mix of objective and subjective criteria
Scenario-based testing
Complex Tests
4+ validation types
Multi-step workflows
Edge case handling
Integration testing
Performance Considerations
Response Time: Set realistic thresholds based on use case
LLM V0: Can be slower and more expensive, use judiciously