Skip to Content
🎉 Welcome to handit.ai Documentation!

Custom Evaluators

Define your quality standards with precision. Custom evaluators allow you to create sophisticated evaluation systems tailored to your specific business requirements, ensuring AI responses meet your exact quality criteria.

This guide covers creating evaluation systems that automatically assess AI responses against your defined standards using LLM-as-Judge methodology.

Custom evaluators use your configured model tokens to automatically assess AI responses against your defined criteria.

Evaluator Components

A custom evaluator consists of:

Evaluation Criteria

Define what constitutes quality for your specific use case

  • Identify key quality dimensions
  • Set clear scoring guidelines
  • Define minimum acceptable thresholds

Evaluation Prompt

Instruction template that guides the judge model

  • Clear role definition
  • Specific evaluation criteria
  • Context requirements
  • Output format specifications

Scoring Framework

How to interpret and score responses consistently

  • Numerical scoring scale (e.g., 1-10)
  • Weighted criteria if needed
  • Confidence scoring

Output Format

Structured format for evaluation results and insights

  • JSON schema for consistency
  • Required fields and data types
  • Optional metadata fields
  • Error handling format

Pro tip: Start with a simple evaluator focusing on 3-5 key criteria, then expand as you gather more data and insights about your AI’s performance.

Creating Your First Evaluator

1. Navigate to Evaluators

  • Go to Evaluation Suite
  • Click Create New Evaluator
  • Choose between template-based or custom creation
  • Consider your evaluation goals and requirements

2. Basic Configuration

Evaluator Name

Customer Support Quality
  • Use descriptive, purpose-specific names
  • Include version number if applicable
  • Consider environment prefix (dev/prod)

Model Token

  • Select your configured judge model from the dropdown
  • Choose based on evaluation complexity and cost requirements
  • Consider model capabilities and limitations
  • Factor in response time requirements

3. Create Evaluation Prompt

This is where you define your evaluation criteria and instructions for the judge model, example:

You are evaluating customer service AI responses for quality. Rate the response on these criteria (1-10 scale, 10 = excellent): 1. **Helpfulness**: Does the response directly address and solve the customer's issue? 2. **Accuracy**: Is all provided information correct and up-to-date? 3. **Professionalism**: Is the tone and language appropriate for business communication? 4. **Completeness**: Are all aspects of the customer's question thoroughly addressed? **Context:** - Customer Question: {input} - AI Response: {output} - Customer History: {context} **Output Format (JSON):** { "scores": { "helpfulness": <1-10>, "accuracy": <1-10>, "professionalism": <1-10>, "completeness": <1-10>, "overall": <average> }, "reasoning": "Brief explanation of scoring decisions", "strengths": ["What the response did well"], "improvements": ["Specific areas for improvement"], "escalation_needed": <true/false> } Provide detailed, constructive feedback to help improve response quality.

4. Save Evaluator

  • Review your configuration
  • Click Save to create the evaluator

Prompt Engineering Best Practices

Essential Prompt Structure

1. Clear Role Definition

You are an expert evaluator specializing in [domain]. Your goal is to assess [specific aspect] of AI responses. Your evaluation should be: - Objective and consistent - Based on defined criteria - Focused on actionable feedback - Considerate of business context

2. Specific Evaluation Criteria

Rate the response on these dimensions: 1. **Criterion Name** (1-10): Detailed description of what this measures 2. **Second Criterion** (1-10): Clear explanation of quality indicators

3. Context Provision

**Context:** - Original Question: {input} - AI Response: {output} - Additional Context: {context} - User Profile: {user_metadata}

4. Output Format Specification

**Output Format (JSON):** { "scores": { ... }, "reasoning": "...", "confidence": 0.85 }

Domain-Specific Examples

Real-world evaluation templates that you can use today! These examples demonstrate how to create specialized evaluators for different industries, each with carefully crafted criteria and scoring frameworks. Use these as starting points and customize them for your specific needs.

These examples showcase how to adapt evaluation criteria for different domains:

  • E-commerce: Focus on product recommendations, customer satisfaction, and business impact
  • Healthcare: Prioritize medical accuracy, safety, and appropriate scope
  • Education: Emphasize learning effectiveness, clarity, and age-appropriate content
  • Technical Support: Concentrate on problem-solving accuracy and user safety

Each example includes:

  • Domain-specific evaluation criteria
  • Industry-standard scoring guidelines
  • Practical implementation tips

Next Steps

Ready to deploy your custom evaluator?

Your custom evaluator is ready! Next, learn how to assign it to your LLM nodes to start automatic quality assessment.

Last updated on