Skip to Content
πŸŽ‰ Welcome to handit.ai Documentation!

Viewing Evaluation Results

Monitor your AI quality in real-time. View evaluation results through failed agent runs in tracing and quality metrics in your agent performance dashboard.

Access and understand evaluation results that help you monitor your AI’s quality performance.

Evaluation results appear in real-time as your evaluators assess production interactions. All viewing happens through the Handit.ai platform dashboard.

How Evaluation Results Work

Evaluation results are displayed in two key ways:

Failed Agent Runs in Tracing

See which agent runs failed due to evaluator-detected quality issues

Quality Metrics in Agent Performance

View calculated metrics like accuracy, coherence, and completeness per evaluator

Method 1: Failed Agent Runs in Tracing

When an evaluator detects quality issues, the agent run is marked as failed in your tracing dashboard.

1. Navigate to Tracing

  • Go to your Tracing page
  • Look for agent runs marked with failure indicators

AI Agent Tracing Dashboard

2. Identify Failed Runs

  • Failed runs are visually highlighted
  • Failure indicators show which runs had quality issues
  • Quick overview of failure reasons

3. View Failure Details

  • Click on a failed agent run
  • See detailed tracing information
  • Visual indicators show which LLM node failed
  • View the specific failure detected by the evaluator

Example Failure Detail:

Agent Run: Customer Support Interaction Status: ❌ Failed Failed Node: Customer Response Generator Evaluator: Customer Support Quality Failure Reason: "Response lacked empathy and did not address the customer's specific concern about delivery delay" Quality Score: 2.1/5.0

Method 2: Quality Metrics in Agent Performance

Evaluation results are calculated into metrics that appear in your Agent Performance dashboard.

1. Navigate to Agent Performance

  • Go to Agent Performance dashboard
  • View your agents and their associated LLM nodes

2. View Quality Metrics Each evaluator generates specific metrics based on its evaluation criteria:

Example Metrics Display:

Customer Support Agent LLM Node: Response Generator β”œβ”€β”€ Customer Support Quality Evaluator β”‚ β”œβ”€β”€ Helpfulness: 4.2/5.0 β”‚ β”œβ”€β”€ Accuracy: 4.5/5.0 β”‚ β”œβ”€β”€ Professionalism: 4.1/5.0 β”‚ └── Overall: 4.3/5.0 β”‚ β”œβ”€β”€ Safety Compliance Evaluator β”‚ β”œβ”€β”€ Content Safety: 4.8/5.0 β”‚ β”œβ”€β”€ Brand Compliance: 4.6/5.0 β”‚ └── Overall: 4.7/5.0 β”‚ └── Response Completeness Evaluator β”œβ”€β”€ Completeness: 4.0/5.0 β”œβ”€β”€ Coherence: 4.3/5.0 └── Overall: 4.2/5.0

3. Monitor Trends

  • Track quality metrics over time
  • See how different evaluators rate your AI’s performance
  • Identify patterns and areas for improvement

Understanding Quality Metrics

How Metrics Are Generated

Per Evaluator Metrics:

  • Each evaluator calculates its own set of metrics
  • Metrics are based on the evaluation criteria defined in the evaluator prompt
  • Scores are aggregated over time to show trends

Example: Customer Support Quality Evaluator

Evaluation Criteria β†’ Generated Metrics β”œβ”€β”€ Helpfulness (1-10) β†’ Helpfulness: 4.2/5.0 β”œβ”€β”€ Accuracy (1-10) β†’ Accuracy: 4.5/5.0 β”œβ”€β”€ Professionalism (1-10) β†’ Professionalism: 4.1/5.0 └── Completeness (1-10) β†’ Completeness: 4.0/5.0

Common Metric Types

Typical Quality Dimensions:

  • Accuracy: How correct and factual are the responses
  • Coherence: How logical and well-structured are the responses
  • Completeness: How thoroughly questions are answered
  • Helpfulness: How useful responses are to users
  • Professionalism: How appropriate the tone and language are
  • Safety: How well responses avoid harmful content

Monitoring and Analysis

Dashboard Views

Agent-Level Overview:

  • Compare performance across different agents
  • Identify which agents need attention
  • View overall quality trends

Node-Level Details:

  • Drill down into specific LLM nodes
  • See metrics from all associated evaluators
  • Track performance trends for individual nodes

Evaluator-Specific Insights:

  • Each evaluator provides its own quality perspective
  • Compare how different evaluators rate the same AI responses
  • Understand which quality dimensions are strongest/weakest

Using Results for Improvement

Failed Runs Analysis:

  • Review failed agent runs to understand quality problems
  • Look for patterns in evaluator failures
  • Identify specific LLM nodes that frequently fail

Metric Trends:

  • Monitor declining quality metrics
  • Spot quality issues before they become major problems
  • Track improvement after making changes

Common Quality Patterns:

  • Low accuracy scores may indicate knowledge gaps
  • Poor coherence might suggest prompt engineering needs
  • Low helpfulness could mean responses aren’t user-focused

Next Steps

Use your evaluation results to improve AI quality:

  • Review failed agent runs to understand specific quality issues
  • Monitor quality metrics trends in Agent Performance dashboard
  • Use insights to guide optimization efforts (covered in our Optimization guides)

You now understand how to view evaluation results! Use failed runs and quality metrics to monitor and improve your AI’s performance.

Last updated on