Custom Evaluators
Define your quality standards with precision. Custom evaluators allow you to create sophisticated evaluation systems tailored to your specific business requirements, ensuring AI responses meet your exact quality criteria.
This guide covers creating evaluation systems that automatically assess AI responses against your defined standards using LLM-as-Judge methodology.
Custom evaluators use your configured model tokens to automatically assess AI responses against your defined criteria.
Evaluator Components
A custom evaluator consists of:
Evaluation Criteria
Define what constitutes quality for your specific use case
- Identify key quality dimensions
- Set clear scoring guidelines
- Define minimum acceptable thresholds
Evaluation Prompt
Instruction template that guides the judge model
- Clear role definition
- Specific evaluation criteria
- Context requirements
- Output format specifications
Scoring Framework
How to interpret and score responses consistently
- Numerical scoring scale (e.g., 1-10)
- Weighted criteria if needed
- Confidence scoring
Output Format
Structured format for evaluation results and insights
- JSON schema for consistency
- Required fields and data types
- Optional metadata fields
- Error handling format
Pro tip: Start with a simple evaluator focusing on 3-5 key criteria, then expand as you gather more data and insights about your AI’s performance.
Creating Your First Evaluator
1. Navigate to Evaluators
- Go to Evaluation Suite
- Click Create New Evaluator
- Choose between template-based or custom creation
- Consider your evaluation goals and requirements
2. Basic Configuration
Evaluator Name
Customer Support Quality
- Use descriptive, purpose-specific names
- Include version number if applicable
- Consider environment prefix (dev/prod)
Model Token
- Select your configured judge model from the dropdown
- Choose based on evaluation complexity and cost requirements
- Consider model capabilities and limitations
- Factor in response time requirements
3. Create Evaluation Prompt
This is where you define your evaluation criteria and instructions for the judge model, example:
You are evaluating customer service AI responses for quality.
Rate the response on these criteria (1-10 scale, 10 = excellent):
1. **Helpfulness**: Does the response directly address and solve the customer's issue?
2. **Accuracy**: Is all provided information correct and up-to-date?
3. **Professionalism**: Is the tone and language appropriate for business communication?
4. **Completeness**: Are all aspects of the customer's question thoroughly addressed?
**Context:**
- Customer Question: {input}
- AI Response: {output}
- Customer History: {context}
**Output Format (JSON):**
{
"scores": {
"helpfulness": <1-10>,
"accuracy": <1-10>,
"professionalism": <1-10>,
"completeness": <1-10>,
"overall": <average>
},
"reasoning": "Brief explanation of scoring decisions",
"strengths": ["What the response did well"],
"improvements": ["Specific areas for improvement"],
"escalation_needed": <true/false>
}
Provide detailed, constructive feedback to help improve response quality.
4. Save Evaluator
- Review your configuration
- Click Save to create the evaluator
Prompt Engineering Best Practices
Essential Prompt Structure
1. Clear Role Definition
You are an expert evaluator specializing in [domain].
Your goal is to assess [specific aspect] of AI responses.
Your evaluation should be:
- Objective and consistent
- Based on defined criteria
- Focused on actionable feedback
- Considerate of business context
2. Specific Evaluation Criteria
Rate the response on these dimensions:
1. **Criterion Name** (1-10): Detailed description of what this measures
2. **Second Criterion** (1-10): Clear explanation of quality indicators
3. Context Provision
**Context:**
- Original Question: {input}
- AI Response: {output}
- Additional Context: {context}
- User Profile: {user_metadata}
4. Output Format Specification
**Output Format (JSON):**
{
"scores": { ... },
"reasoning": "...",
"confidence": 0.85
}
Domain-Specific Examples
Real-world evaluation templates that you can use today! These examples demonstrate how to create specialized evaluators for different industries, each with carefully crafted criteria and scoring frameworks. Use these as starting points and customize them for your specific needs.
These examples showcase how to adapt evaluation criteria for different domains:
- E-commerce: Focus on product recommendations, customer satisfaction, and business impact
- Healthcare: Prioritize medical accuracy, safety, and appropriate scope
- Education: Emphasize learning effectiveness, clarity, and age-appropriate content
- Technical Support: Concentrate on problem-solving accuracy and user safety
Each example includes:
- Domain-specific evaluation criteria
- Industry-standard scoring guidelines
- Practical implementation tips
Next Steps
Ready to deploy your custom evaluator?
- Assign to LLM nodes for automated quality assessment
- Configure sampling percentages and quality thresholds
- Monitor evaluation results and optimize performance
Your custom evaluator is ready! Next, learn how to assign it to your LLM nodes to start automatic quality assessment.