Skip to Content
🎉 Welcome to handit.ai Documentation!
EvaluationQuickstart

Evaluation Quickstart

Get automated AI evaluation running in under 10 minutes. This guide walks you through setting up comprehensive quality assessment for your AI models using the Handit.ai platform.

Prerequisites: You need an active Handit.ai account with tracing configured. If you haven’t set up tracing yet, start with our Tracing Quickstart.

What You’ll Accomplish

Connect Model Tokens

Add your evaluation models (GPT-4, Llama, etc.) to the platform

Create Single-Purpose Evaluators

Design focused evaluators that check ONE specific quality aspect

Associate Evaluators to LLM Nodes

Connect your evaluators to the AI functions you want to assess

Monitor Results

View evaluation results and quality insights in real-time

⚠️

Critical Best Practice: Create separate evaluators for each quality aspect (completeness, accuracy, format, etc.). Don’t try to evaluate multiple things in one prompt—this reduces effectiveness and clarity.

Step 1: Connect Model Tokens

Connect the AI models that will act as “judges” to evaluate your LLM responses.

  • Go to your Handit.ai dashboard
  • Click SettingsModel Tokens
  • Click Add New Token

Add your OpenAI or Together AI credentials to connect evaluation models.

Recommended Models: GPT-4o (highest accuracy), GPT-3.5-turbo (cost-effective), Llama v4 Scout (open source alternative)

Step 2: Create Single-Purpose Evaluators

Create focused evaluators in the Evaluation Suite. Remember: one evaluator = one quality aspect.

  • Go to EvaluationEvaluation Suite
  • Click Create New Evaluator

Example 1: Completeness Evaluator

Evaluation Prompt:

You are evaluating whether an AI response completely addresses the user's question. Focus ONLY on completeness - ignore other quality aspects. User Question: {input} AI Response: {output} Rate on a scale of 1-10: 1-2 = Missing major parts of the question 3-4 = Addresses some parts but incomplete 5-6 = Addresses most parts adequately 7-8 = Addresses all parts well 9-10 = Thoroughly addresses every aspect Provide your score and brief reasoning. Output format: Score: [1-10] Reasoning: [Brief explanation]

Example 2: Hallucination Detection Evaluator

Evaluation Prompt:

You are checking if an AI response contains hallucinations or made-up information. Focus ONLY on factual accuracy - ignore other aspects like tone or completeness. User Question: {input} AI Response: {output} Rate on a scale of 1-10: 1-2 = Contains obvious false information 3-4 = Contains questionable claims 5-6 = Mostly accurate with minor concerns 7-8 = Accurate information 9-10 = Completely accurate and verifiable Output format: Score: [1-10] Reasoning: [Brief explanation of any concerns]

Example 3: Format Compliance Evaluator

Evaluation Prompt:

You are checking if an AI response follows the required format guidelines. Focus ONLY on format compliance - ignore content quality. User Question: {input} AI Response: {output} Expected Format: [Professional tone, structured paragraphs, proper punctuation] Rate on a scale of 1-10: 1-2 = Poor formatting, unprofessional 3-4 = Some format issues 5-6 = Acceptable formatting 7-8 = Good formatting 9-10 = Perfect format compliance Output format: Score: [1-10] Reasoning: [Brief format assessment]

Why Separate Evaluators? Each evaluator focuses on one aspect, making evaluation more reliable and providing clear insights. You can see exactly which quality dimensions are performing well or need attention.

Step 3: Associate Evaluators to LLM Nodes

Connect your evaluators to the specific LLM nodes you want to monitor.

Find Your LLM Nodes

  • Go to TracingNodes
  • Identify the LLM nodes you want to evaluate (e.g., “customer-response-generation”)

Associate Evaluators

  • Click on your target LLM node
  • Navigate to Evaluation tab
  • Click Add Evaluator

Configure Evaluation Settings

For each evaluator you want to associate:

Evaluator: "Response Completeness Check" Evaluation Percentage: 10% (start small) Priority: Normal
Evaluator: "Hallucination Detection" Evaluation Percentage: 15% (higher for critical accuracy) Priority: High
Evaluator: "Format Compliance" Evaluation Percentage: 5% (lower for less critical aspects) Priority: Normal

Sampling Strategy: Start with 5-15% evaluation percentage. You can always increase it later. Use higher percentages for critical quality aspects.

Save Configuration

  • Review your evaluator associations
  • Click Save to activate evaluation

Step 4: Monitor Evaluation Results

Your evaluation system is now active! Here’s how to monitor the results.

Real-Time Monitoring

  • Go to Tracing tab in your dashboard
  • View evaluation scores as they happen
  • Monitor quality trends across different aspects

AI Agent Tracing Dashboard

Quality Analytics

  • Access Agent Performance tab in your dashboard
  • View trends for each agent with their associated evaluators
  • Understand which quality aspects are performing well or need attention for each agent

Agent Performance Dashboard

Sample Result Display

Additional Evaluator Examples

Customer Service Evaluators

Empathy Evaluator:

Focus: Does the response show understanding and care for the customer's situation? Scale: 1-10 (1-2=cold/robotic, 9-10=highly empathetic)

Solution Clarity Evaluator:

Focus: Are the solution steps clear and easy to follow? Scale: 1-10 (1-2=confusing steps, 9-10=crystal clear instructions)

Policy Compliance Evaluator:

Focus: Does the response follow company policies and guidelines? Scale: 1-10 (1-2=policy violations, 9-10=perfect compliance)

Technical Support Evaluators

Technical Accuracy Evaluator:

Focus: Is the technical information provided correct? Scale: 1-10 (1-2=incorrect technical details, 9-10=completely accurate)

Safety Check Evaluator:

Focus: Are the suggested actions safe and won't cause system damage? Scale: 1-10 (1-2=potentially dangerous, 9-10=completely safe)

Troubleshooting Flow Evaluator:

Focus: Does the response follow logical troubleshooting steps? Scale: 1-10 (1-2=illogical flow, 9-10=perfect troubleshooting sequence)

Best Practices

✅ Do’s

  • Create one evaluator per quality aspect
  • Use clear, specific evaluation criteria
  • Start with lower evaluation percentages
  • Focus on business-critical quality dimensions
  • Test evaluators before deploying

❌ Don’ts

  • Combine multiple quality checks in one evaluator
  • Use vague evaluation criteria
  • Evaluate 100% of traffic initially
  • Ignore the results without taking action
  • Create overly complex evaluation prompts

Next Steps

Congratulations! You now have automated AI evaluation running on your production data. Each evaluator will help you understand specific quality aspects. This evaluation data provides the foundation for AI optimization (covered in our Optimization guides).

Troubleshooting

Evaluations Not Running?

  • Check your model tokens are valid and have sufficient credits
  • Verify the LLM node is actively receiving traffic
  • Ensure evaluation percentage is > 0%
  • Confirm evaluators are properly associated

Inconsistent Evaluation Scores?

  • Review evaluator prompts for clarity and specificity
  • Consider if evaluation criteria are too subjective
  • Test evaluators with known good/bad examples
  • Focus on single quality aspects per evaluator

Need Help?

Last updated on