Evaluation Quickstart

Get automated AI evaluation running in under 10 minutes. This guide walks you through setting up comprehensive quality assessment for your AI models using the Handit.ai platform.

Prerequisites: You need an active Handit.ai account with tracing configured. If you haven’t set up tracing yet, start with our Tracing Quickstart.

What You’ll Accomplish

Connect Model Tokens

Add your evaluation models (GPT-4, Llama, etc.) to the platform

Create Single-Purpose Evaluators

Design focused evaluators that check ONE specific quality aspect

Associate Evaluators to LLM Nodes

Connect your evaluators to the AI functions you want to assess

Monitor Results

View evaluation results and quality insights in real-time

⚠️

Critical Best Practice: Create separate evaluators for each quality aspect (completeness, accuracy, format, etc.). Don’t try to evaluate multiple things in one prompt—this reduces effectiveness and clarity.

Step 1: Connect Model Tokens

Connect the AI models that will act as “judges” to evaluate your LLM responses.

Navigate to Model Tokens

Go to your Handit.ai dashboard
Click Settings → Model Tokens
Click Add New Token

Add your OpenAI or Together AI credentials to connect evaluation models.

Recommended Models: GPT-4o (highest accuracy), GPT-3.5-turbo (cost-effective), Llama v4 Scout (open source alternative)

Step 2: Create Single-Purpose Evaluators

Create focused evaluators in the Evaluation Suite. Remember: one evaluator = one quality aspect.

Navigate to Evaluation Suite

Go to Evaluation → Evaluation Suite
Click Create New Evaluator

Example 1: Completeness Evaluator

Evaluation Prompt:


You are evaluating whether an AI response completely addresses the user's question.

Focus ONLY on completeness - ignore other quality aspects.

User Question: {input}
AI Response: {output}

Rate on a scale of 1-10:
1-2 = Missing major parts of the question
3-4 = Addresses some parts but incomplete
5-6 = Addresses most parts adequately  
7-8 = Addresses all parts well
9-10 = Thoroughly addresses every aspect

Provide your score and brief reasoning.

Output format:
Score: [1-10]
Reasoning: [Brief explanation]

Example 2: Hallucination Detection Evaluator

Evaluation Prompt:


You are checking if an AI response contains hallucinations or made-up information.

Focus ONLY on factual accuracy - ignore other aspects like tone or completeness.

User Question: {input}
AI Response: {output}

Rate on a scale of 1-10:
1-2 = Contains obvious false information
3-4 = Contains questionable claims
5-6 = Mostly accurate with minor concerns
7-8 = Accurate information
9-10 = Completely accurate and verifiable

Output format:
Score: [1-10]
Reasoning: [Brief explanation of any concerns]

Example 3: Format Compliance Evaluator

Evaluation Prompt:


You are checking if an AI response follows the required format guidelines.

Focus ONLY on format compliance - ignore content quality.

User Question: {input}
AI Response: {output}
Expected Format: [Professional tone, structured paragraphs, proper punctuation]

Rate on a scale of 1-10:
1-2 = Poor formatting, unprofessional
3-4 = Some format issues
5-6 = Acceptable formatting
7-8 = Good formatting
9-10 = Perfect format compliance

Output format:
Score: [1-10]
Reasoning: [Brief format assessment]

Why Separate Evaluators? Each evaluator focuses on one aspect, making evaluation more reliable and providing clear insights. You can see exactly which quality dimensions are performing well or need attention.

Step 3: Associate Evaluators to LLM Nodes

Connect your evaluators to the specific LLM nodes you want to monitor.

Find Your LLM Nodes

Go to Tracing → Nodes
Identify the LLM nodes you want to evaluate (e.g., “customer-response-generation”)

Associate Evaluators

Click on your target LLM node
Navigate to Evaluation tab
Click Add Evaluator

Configure Evaluation Settings

For each evaluator you want to associate:


Evaluator: "Response Completeness Check"
Evaluation Percentage: 10% (start small)
Priority: Normal


Evaluator: "Hallucination Detection"  
Evaluation Percentage: 15% (higher for critical accuracy)
Priority: High


Evaluator: "Format Compliance"
Evaluation Percentage: 5% (lower for less critical aspects)
Priority: Normal

Sampling Strategy: Start with 5-15% evaluation percentage. You can always increase it later. Use higher percentages for critical quality aspects.

Save Configuration

Review your evaluator associations
Click Save to activate evaluation

Step 4: Monitor Evaluation Results

Your evaluation system is now active! Here’s how to monitor the results.

Real-Time Monitoring

Go to Tracing tab in your dashboard
View evaluation scores as they happen
Monitor quality trends across different aspects

AI Agent Tracing Dashboard

Quality Analytics

Access Agent Performance tab in your dashboard
View trends for each agent with their associated evaluators
Understand which quality aspects are performing well or need attention for each agent

Agent Performance Dashboard

Sample Result Display

Additional Evaluator Examples

Customer Service Evaluators

Empathy Evaluator:


Focus: Does the response show understanding and care for the customer's situation?
Scale: 1-10 (1-2=cold/robotic, 9-10=highly empathetic)

Solution Clarity Evaluator:


Focus: Are the solution steps clear and easy to follow?
Scale: 1-10 (1-2=confusing steps, 9-10=crystal clear instructions)

Policy Compliance Evaluator:


Focus: Does the response follow company policies and guidelines?
Scale: 1-10 (1-2=policy violations, 9-10=perfect compliance)

Technical Support Evaluators

Technical Accuracy Evaluator:


Focus: Is the technical information provided correct?
Scale: 1-10 (1-2=incorrect technical details, 9-10=completely accurate)

Safety Check Evaluator:


Focus: Are the suggested actions safe and won't cause system damage?
Scale: 1-10 (1-2=potentially dangerous, 9-10=completely safe)

Troubleshooting Flow Evaluator:


Focus: Does the response follow logical troubleshooting steps?
Scale: 1-10 (1-2=illogical flow, 9-10=perfect troubleshooting sequence)

Best Practices

✅ Do’s

Create one evaluator per quality aspect
Use clear, specific evaluation criteria
Start with lower evaluation percentages
Focus on business-critical quality dimensions
Test evaluators before deploying

❌ Don’ts

Combine multiple quality checks in one evaluator
Use vague evaluation criteria
Evaluate 100% of traffic initially
Ignore the results without taking action
Create overly complex evaluation prompts

Next Steps

Set up Optimization to automatically improve your prompts
Explore Advanced Evaluation Features
Configure Custom Evaluators
Visit GitHub Issues for assistance

Congratulations! You now have automated AI evaluation running on your production data. Each evaluator will help you understand specific quality aspects. This evaluation data provides the foundation for AI optimization (covered in our Optimization guides).

Troubleshooting

Evaluations Not Running?

Check your model tokens are valid and have sufficient credits
Verify the LLM node is actively receiving traffic
Ensure evaluation percentage is > 0%
Confirm evaluators are properly associated

Inconsistent Evaluation Scores?

Review evaluator prompts for clarity and specificity
Consider if evaluation criteria are too subjective
Test evaluators with known good/bad examples
Focus on single quality aspects per evaluator

Need Help?

Check our detailed evaluation guides
Visit Support for assistance
Join our Discord community