Custom Evaluators
Define your quality standards with precision. Custom evaluators let you create evaluation systems tailored to your specific business requirements, ensuring your AI meets the exact quality criteria that matter most to your users.
Transform generic quality assessment into domain-specific evaluation that understands your unique requirements and standards.
Custom evaluators use your configured model tokens to automatically assess AI responses against criteria you define, providing quality insights specific to your use case.
Why Custom Evaluators Matter
While Handit provides powerful general-purpose evaluators, every AI application has unique quality requirements. A medical AI needs to assess clinical accuracy, a customer service AI must evaluate empathy and resolution effectiveness, and an educational AI should focus on clarity and pedagogical value.
Generic evaluation falls short when your AI operates in specialized domains or serves specific user needs. Custom evaluators bridge this gap by incorporating your domain expertise, business requirements, and user expectations into automated quality assessment.
The result: Quality evaluation that truly reflects what matters for your application, enabling your autonomous engineer to generate improvements that align with your specific goals and standards.
Creating Effective Custom Evaluators
The key to successful custom evaluation is focus and clarity. The most effective evaluators assess one specific quality dimension with clear, unambiguous criteria.
Start with Clear Objectives
Before creating an evaluator, define exactly what you want to measure. Instead of “overall quality,” focus on specific aspects like “medical accuracy,” “brand voice consistency,” or “regulatory compliance.”
Example objectives:
- Medical AI: “Does the response provide medically accurate information without making diagnostic claims?”
- Legal AI: “Does the response comply with legal disclaimers while providing helpful guidance?”
- Educational AI: “Does the explanation use age-appropriate language and build on foundational concepts?”
Design Focused Evaluation Criteria
Each evaluator should have 3-5 specific criteria that define quality for that dimension. More criteria make evaluation inconsistent and results harder to interpret.
Example: Customer Support Empathy Evaluator
- Does the response acknowledge the customer’s frustration or concern?
- Does it use empathetic language appropriate to the situation?
- Does it avoid dismissive or robotic phrasing?
- Does it show understanding of the customer’s perspective?
Create Clear Evaluation Prompts
Your evaluation prompt is the instruction you give to the judge model (like GPT-4) to assess quality. Effective prompts are specific, provide examples, and define scoring clearly.
Structure of effective evaluation prompts:
Role Definition: “You are an expert customer service quality assessor evaluating AI responses for empathy and emotional intelligence.”
Context Provision: Include the original user question, the AI’s response, and any relevant background context the evaluator needs.
Specific Criteria: List exactly what to evaluate with clear definitions and examples of good vs. poor performance.
Scoring Instructions: Define your scoring scale (typically 1-10) with specific anchors for different score ranges.
Output Format: Specify exactly how you want results formatted, typically as structured JSON for consistency.
Real-World Examples
Here are examples of custom evaluators that work well in practice:
Medical Information Accuracy
Purpose: Ensure medical AI provides accurate information without overstepping into diagnosis Key Criteria: Factual accuracy, appropriate disclaimers, avoiding diagnostic language, recommending professional consultation when needed Scoring Focus: Accuracy of medical facts, appropriateness of advice level, presence of necessary disclaimers
Brand Voice Consistency
Purpose: Maintain consistent brand personality across all AI interactions Key Criteria: Tone alignment with brand guidelines, appropriate formality level, consistent terminology usage, brand value reflection Scoring Focus: Alignment with brand voice guide, consistency with previous interactions, appropriate personality expression
Regulatory Compliance
Purpose: Ensure AI responses meet industry-specific regulatory requirements Key Criteria: Required disclosures present, prohibited claims avoided, appropriate risk warnings, compliance with industry standards Scoring Focus: Presence of required elements, absence of prohibited content, overall compliance risk level
Implementation Best Practices
Start Simple: Begin with one clear quality dimension and 3-4 specific criteria. You can always add complexity later as you understand what works.
Test Thoroughly: Before deploying custom evaluators, test them on a variety of your AI’s responses to ensure they provide consistent, meaningful scores.
Iterate Based on Data: Review evaluation results regularly and refine your criteria based on what you learn about your AI’s performance patterns.
Maintain Consistency: Use clear, specific language in your evaluation prompts to ensure consistent scoring across different interactions.
Avoid Common Pitfalls: Don’t try to evaluate too many things at once, don’t use vague criteria like “good quality,” and don’t create evaluators without testing them on real examples first.
Advanced Configuration
For sophisticated evaluation needs, custom evaluators support advanced features:
Weighted Scoring: If some criteria are more important than others, you can weight them accordingly in your evaluation logic.
Conditional Evaluation: Create evaluators that apply different criteria based on the type of interaction or user context.
Multi-Model Evaluation: Use different judge models for different types of evaluation—GPT-4 for complex reasoning, specialized models for domain-specific assessment.
Confidence Scoring: Include confidence levels in evaluation results to identify cases where the evaluator is uncertain about quality assessment.
Monitoring Evaluator Performance
Track how well your custom evaluators work in practice:
Score Distribution: Monitor the range and distribution of scores your evaluators produce. If everything scores very high or very low, you might need to adjust criteria or scoring scales.
Correlation with Business Metrics: Check whether evaluator scores correlate with business outcomes like user satisfaction, conversion rates, or support ticket volume.
Evaluator Agreement: If you have multiple evaluators for similar quality dimensions, check whether they provide consistent insights about your AI’s performance.
Edge Case Handling: Review how evaluators perform on unusual or challenging interactions to ensure they provide meaningful assessment across all scenarios.
Integration with Autonomous Improvement
Custom evaluators become particularly powerful when integrated with autonomous improvement:
Targeted Fixes: Your autonomous engineer uses custom evaluation results to generate improvements specific to your quality requirements. If brand voice consistency scores drop, it creates fixes that address brand voice specifically.
Domain Expertise: Custom evaluators give your autonomous engineer domain-specific knowledge about what constitutes quality in your application, leading to more relevant and effective improvements.
Business Alignment: Improvements generated by your autonomous engineer align with your business goals and quality standards rather than generic AI quality metrics.
Getting Started
Ready to create custom evaluators that understand your specific quality requirements?
Define Your Quality Dimensions: Start by identifying the 2-3 most important quality aspects specific to your AI application. What makes a response truly valuable for your users?
Create Your First Evaluator: Begin with one focused evaluator that assesses a single quality dimension clearly and specifically. Test it thoroughly before deploying.
Monitor and Iterate: Use evaluation results to understand your AI’s performance patterns and refine your evaluators based on what you learn.
Ready to build? Navigate to Evaluation Suite in your dashboard and click Create New Evaluator to start building quality assessment tailored to your specific needs.
Custom evaluators transform generic quality monitoring into precise, actionable insights that drive meaningful improvements in your AI’s performance. Start building evaluation systems that truly understand what quality means for your application.