Evaluation Quickstart
Get automated AI evaluation running in under 10 minutes. This guide walks you through setting up comprehensive quality assessment for your AI models using the Handit.ai platform.
Prerequisites: You need an active Handit.ai account with tracing configured. If you haven’t set up tracing yet, start with our Tracing Quickstart.
What You’ll Accomplish
Connect Model Tokens
Add your evaluation models (GPT-4, Llama, etc.) to the platform
Create Single-Purpose Evaluators
Design focused evaluators that check ONE specific quality aspect
Associate Evaluators to LLM Nodes
Connect your evaluators to the AI functions you want to assess
Monitor Results
View evaluation results and quality insights in real-time
Critical Best Practice: Create separate evaluators for each quality aspect (completeness, accuracy, format, etc.). Don’t try to evaluate multiple things in one prompt—this reduces effectiveness and clarity.
Step 1: Connect Model Tokens
Connect the AI models that will act as “judges” to evaluate your LLM responses.
Navigate to Model Tokens
- Go to your Handit.ai dashboard
- Click Settings → Model Tokens
- Click Add New Token
Add your OpenAI or Together AI credentials to connect evaluation models.
Recommended Models: GPT-4o (highest accuracy), GPT-3.5-turbo (cost-effective), Llama v4 Scout (open source alternative)
Step 2: Create Single-Purpose Evaluators
Create focused evaluators in the Evaluation Suite. Remember: one evaluator = one quality aspect.
Navigate to Evaluation Suite
- Go to Evaluation → Evaluation Suite
- Click Create New Evaluator
Example 1: Completeness Evaluator
Evaluation Prompt:
You are evaluating whether an AI response completely addresses the user's question.
Focus ONLY on completeness - ignore other quality aspects.
User Question: {input}
AI Response: {output}
Rate on a scale of 1-10:
1-2 = Missing major parts of the question
3-4 = Addresses some parts but incomplete
5-6 = Addresses most parts adequately
7-8 = Addresses all parts well
9-10 = Thoroughly addresses every aspect
Provide your score and brief reasoning.
Output format:
Score: [1-10]
Reasoning: [Brief explanation]
Example 2: Hallucination Detection Evaluator
Evaluation Prompt:
You are checking if an AI response contains hallucinations or made-up information.
Focus ONLY on factual accuracy - ignore other aspects like tone or completeness.
User Question: {input}
AI Response: {output}
Rate on a scale of 1-10:
1-2 = Contains obvious false information
3-4 = Contains questionable claims
5-6 = Mostly accurate with minor concerns
7-8 = Accurate information
9-10 = Completely accurate and verifiable
Output format:
Score: [1-10]
Reasoning: [Brief explanation of any concerns]
Example 3: Format Compliance Evaluator
Evaluation Prompt:
You are checking if an AI response follows the required format guidelines.
Focus ONLY on format compliance - ignore content quality.
User Question: {input}
AI Response: {output}
Expected Format: [Professional tone, structured paragraphs, proper punctuation]
Rate on a scale of 1-10:
1-2 = Poor formatting, unprofessional
3-4 = Some format issues
5-6 = Acceptable formatting
7-8 = Good formatting
9-10 = Perfect format compliance
Output format:
Score: [1-10]
Reasoning: [Brief format assessment]
Why Separate Evaluators? Each evaluator focuses on one aspect, making evaluation more reliable and providing clear insights. You can see exactly which quality dimensions are performing well or need attention.
Step 3: Associate Evaluators to LLM Nodes
Connect your evaluators to the specific LLM nodes you want to monitor.
Find Your LLM Nodes
- Go to Tracing → Nodes
- Identify the LLM nodes you want to evaluate (e.g., “customer-response-generation”)
Associate Evaluators
- Click on your target LLM node
- Navigate to Evaluation tab
- Click Add Evaluator
Configure Evaluation Settings
For each evaluator you want to associate:
Evaluator: "Response Completeness Check"
Evaluation Percentage: 10% (start small)
Priority: Normal
Evaluator: "Hallucination Detection"
Evaluation Percentage: 15% (higher for critical accuracy)
Priority: High
Evaluator: "Format Compliance"
Evaluation Percentage: 5% (lower for less critical aspects)
Priority: Normal
Sampling Strategy: Start with 5-15% evaluation percentage. You can always increase it later. Use higher percentages for critical quality aspects.
Save Configuration
- Review your evaluator associations
- Click Save to activate evaluation
Step 4: Monitor Evaluation Results
Your evaluation system is now active! Here’s how to monitor the results.
Real-Time Monitoring
- Go to Tracing tab in your dashboard
- View evaluation scores as they happen
- Monitor quality trends across different aspects
Quality Analytics
- Access Agent Performance tab in your dashboard
- View trends for each agent with their associated evaluators
- Understand which quality aspects are performing well or need attention for each agent
Sample Result Display
Additional Evaluator Examples
Customer Service Evaluators
Empathy Evaluator:
Focus: Does the response show understanding and care for the customer's situation?
Scale: 1-10 (1-2=cold/robotic, 9-10=highly empathetic)
Solution Clarity Evaluator:
Focus: Are the solution steps clear and easy to follow?
Scale: 1-10 (1-2=confusing steps, 9-10=crystal clear instructions)
Policy Compliance Evaluator:
Focus: Does the response follow company policies and guidelines?
Scale: 1-10 (1-2=policy violations, 9-10=perfect compliance)
Technical Support Evaluators
Technical Accuracy Evaluator:
Focus: Is the technical information provided correct?
Scale: 1-10 (1-2=incorrect technical details, 9-10=completely accurate)
Safety Check Evaluator:
Focus: Are the suggested actions safe and won't cause system damage?
Scale: 1-10 (1-2=potentially dangerous, 9-10=completely safe)
Troubleshooting Flow Evaluator:
Focus: Does the response follow logical troubleshooting steps?
Scale: 1-10 (1-2=illogical flow, 9-10=perfect troubleshooting sequence)
Best Practices
✅ Do’s
- Create one evaluator per quality aspect
- Use clear, specific evaluation criteria
- Start with lower evaluation percentages
- Focus on business-critical quality dimensions
- Test evaluators before deploying
❌ Don’ts
- Combine multiple quality checks in one evaluator
- Use vague evaluation criteria
- Evaluate 100% of traffic initially
- Ignore the results without taking action
- Create overly complex evaluation prompts
Next Steps
- Set up Optimization to automatically improve your prompts
- Explore Advanced Evaluation Features
- Configure Custom Evaluators
- Visit GitHub Issues for assistance
Congratulations! You now have automated AI evaluation running on your production data. Each evaluator will help you understand specific quality aspects. This evaluation data provides the foundation for AI optimization (covered in our Optimization guides).
Troubleshooting
Evaluations Not Running?
- Check your model tokens are valid and have sufficient credits
- Verify the LLM node is actively receiving traffic
- Ensure evaluation percentage is > 0%
- Confirm evaluators are properly associated
Inconsistent Evaluation Scores?
- Review evaluator prompts for clarity and specificity
- Consider if evaluation criteria are too subjective
- Test evaluators with known good/bad examples
- Focus on single quality aspects per evaluator
Need Help?
- Check our detailed evaluation guides
- Visit Support for assistance
- Join our Discord community