Build a hallucination detection pipeline that combines pre-deployment evaluation with the Evals SDK and continuous production monitoring through LLM Observability enrichments and Evaluator Rules. Use this cookbook when: You want to monitor your RAG application for hallucinations across both testing and production environments. Time to complete: ~25 minutesDocumentation Index
Fetch the complete documentation index at: https://handbook.fiddler.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
- Fiddler account with API access
- LLM credential configured in Settings > LLM Gateway
pip install fiddler-evals fiddler-client pandas
The Two-Layer Approach
Hallucination detection works best as a two-layer pipeline:| Layer | Tool | Purpose |
|---|---|---|
| Pre-deployment | Evals SDK | Test against known scenarios, validate with golden labels |
| Production | LLM Observability + Evaluator Rules | Continuous monitoring of live traffic |
Layer 1: Pre-Deployment Evaluation
Set Up and Connect
Use the RAG Health Metrics triad to distinguish hallucinations from other failure modes:
Replace
URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.Create Hallucination-Focused Test Cases
Design test cases that specifically probe for hallucination patterns:
Interpret Results
Use the diagnostic workflow to classify failures:Expected output:
Reading the diagnosis: The triad distinguishes why a response failed:
- HALLUCINATION = Faithfulness fails (response fabricates information)
- BAD RETRIEVAL = Context Relevance fails (wrong documents retrieved)
- OFF-TOPIC = Answer Relevance fails (response doesn’t address the question)
Layer 2: Production Monitoring
- Option A: Evaluator Rules (Agentic)
- Option B: LLM Observability Enrichments
For applications using Agentic Monitoring, configure Evaluator Rules to continuously evaluate production spans:
- Navigate to your application’s Evaluator Rules tab
- Add a rule for RAG Faithfulness
- Map evaluator inputs to your span attributes:
user_query→ your query span attributerag_response→ your response span attributeretrieved_documents→ your context span attribute
- Set alert thresholds (e.g., alert when faithfulness drops below 80%)
Combining Both Layers
The most effective hallucination detection pipeline uses both layers:| Stage | What to Do | Tool |
|---|---|---|
| Development | Test against known hallucination scenarios | Evals SDK + RAG Faithfulness |
| Pre-release | Run experiments comparing pipeline changes | Evals SDK + full diagnostic triad |
| Production | Continuous monitoring with alerting | Evaluator Rules or LLM Obs enrichments |
| Investigation | Deep-dive into flagged events | Evals SDK .score() on specific cases |
Next Steps
- RAG Health Diagnostics — Conceptual guide to failure mode diagnosis
- RAG Evaluation Fundamentals — Direct evaluation with
.score()API - Evaluator Rules — Configure production monitoring rules
Source notebooks: