Custom Evaluators

Define your quality standards with precision. Custom evaluators allow you to create sophisticated evaluation systems tailored to your specific business requirements, ensuring AI responses meet your exact quality criteria.

This guide covers creating evaluation systems that automatically assess AI responses against your defined standards using LLM-as-Judge methodology.

Custom evaluators use your configured model tokens to automatically assess AI responses against your defined criteria.

Evaluator Components

A custom evaluator consists of:

Evaluation Criteria

Define what constitutes quality for your specific use case

Identify key quality dimensions
Set clear scoring guidelines
Define minimum acceptable thresholds

Evaluation Prompt

Instruction template that guides the judge model

Clear role definition
Specific evaluation criteria
Context requirements
Output format specifications

Scoring Framework

How to interpret and score responses consistently

Numerical scoring scale (e.g., 1-10)
Weighted criteria if needed
Confidence scoring

Output Format

Structured format for evaluation results and insights

JSON schema for consistency
Required fields and data types
Optional metadata fields
Error handling format

Pro tip: Start with a simple evaluator focusing on 3-5 key criteria, then expand as you gather more data and insights about your AI’s performance.

Creating Your First Evaluator

1. Navigate to Evaluators

Go to Evaluation Suite
Click Create New Evaluator
Choose between template-based or custom creation
Consider your evaluation goals and requirements

2. Basic Configuration

Evaluator Name


Customer Support Quality

Use descriptive, purpose-specific names
Include version number if applicable
Consider environment prefix (dev/prod)

Model Token

Select your configured judge model from the dropdown
Choose based on evaluation complexity and cost requirements
Consider model capabilities and limitations
Factor in response time requirements

3. Create Evaluation Prompt

This is where you define your evaluation criteria and instructions for the judge model, example:


You are evaluating customer service AI responses for quality.

Rate the response on these criteria (1-10 scale, 10 = excellent):

1. **Helpfulness**: Does the response directly address and solve the customer's issue?
2. **Accuracy**: Is all provided information correct and up-to-date?
3. **Professionalism**: Is the tone and language appropriate for business communication?
4. **Completeness**: Are all aspects of the customer's question thoroughly addressed?

**Context:**
- Customer Question: {input}
- AI Response: {output}
- Customer History: {context}

**Output Format (JSON):**
{
  "scores": {
    "helpfulness": <1-10>,
    "accuracy": <1-10>,
    "professionalism": <1-10>,
    "completeness": <1-10>,
    "overall": <average>
  },
  "reasoning": "Brief explanation of scoring decisions",
  "strengths": ["What the response did well"],
  "improvements": ["Specific areas for improvement"],
  "escalation_needed": <true/false>
}

Provide detailed, constructive feedback to help improve response quality.

4. Save Evaluator

Review your configuration
Click Save to create the evaluator

Prompt Engineering Best Practices

Essential Prompt Structure

1. Clear Role Definition


You are an expert evaluator specializing in [domain].
Your goal is to assess [specific aspect] of AI responses.
Your evaluation should be:
- Objective and consistent
- Based on defined criteria
- Focused on actionable feedback
- Considerate of business context

2. Specific Evaluation Criteria


Rate the response on these dimensions:
1. **Criterion Name** (1-10): Detailed description of what this measures
2. **Second Criterion** (1-10): Clear explanation of quality indicators

3. Context Provision


**Context:**
- Original Question: {input}
- AI Response: {output}
- Additional Context: {context}
- User Profile: {user_metadata}

4. Output Format Specification


**Output Format (JSON):**
{
  "scores": { ... },
  "reasoning": "...",
  "confidence": 0.85
}

Domain-Specific Examples

Real-world evaluation templates that you can use today! These examples demonstrate how to create specialized evaluators for different industries, each with carefully crafted criteria and scoring frameworks. Use these as starting points and customize them for your specific needs.

These examples showcase how to adapt evaluation criteria for different domains:

E-commerce: Focus on product recommendations, customer satisfaction, and business impact
Healthcare: Prioritize medical accuracy, safety, and appropriate scope
Education: Emphasize learning effectiveness, clarity, and age-appropriate content
Technical Support: Concentrate on problem-solving accuracy and user safety

Each example includes:

Domain-specific evaluation criteria
Industry-standard scoring guidelines
Practical implementation tips

Next Steps

Ready to deploy your custom evaluator?

Assign to LLM nodes for automated quality assessment
Configure sampling percentages and quality thresholds
Monitor evaluation results and optimize performance

Your custom evaluator is ready! Next, learn how to assign it to your LLM nodes to start automatic quality assessment.