Skip to main content

Documentation Index

Fetch the complete documentation index at: https://handbook.fiddler.ai/llms.txt

Use this file to discover all available pages before exploring further.

Build a hallucination detection pipeline that combines pre-deployment evaluation with the Evals SDK and continuous production monitoring through LLM Observability enrichments and Evaluator Rules. Use this cookbook when: You want to monitor your RAG application for hallucinations across both testing and production environments. Time to complete: ~25 minutes
Prerequisites
  • Fiddler account with API access
  • LLM credential configured in Settings > LLM Gateway
  • pip install fiddler-evals fiddler-client pandas

The Two-Layer Approach

Hallucination detection works best as a two-layer pipeline:
LayerToolPurpose
Pre-deploymentEvals SDKTest against known scenarios, validate with golden labels
ProductionLLM Observability + Evaluator RulesContinuous monitoring of live traffic

Layer 1: Pre-Deployment Evaluation

1

Set Up and Connect

Use the RAG Health Metrics triad to distinguish hallucinations from other failure modes:
Replace URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.
import pandas as pd
from fiddler_evals import init, evaluate, Project, Application, Dataset
from fiddler_evals.evaluators import (
    AnswerRelevance,
    ContextRelevance,
    RAGFaithfulness,
)

URL = 'https://your-org.fiddler.ai'
TOKEN = 'your-access-token'
LLM_CREDENTIAL_NAME = 'your-llm-credential'
LLM_MODEL_NAME = 'openai/gpt-4o'

init(url=URL, token=TOKEN)

project = Project.get_or_create(name='hallucination_detection')
app = Application.get_or_create(
    name='rag-hallucination-test',
    project_id=project.id,
)
dataset = Dataset.get_or_create(
    name='hallucination-scenarios',
    application_id=app.id,
)
2

Create Hallucination-Focused Test Cases

Design test cases that specifically probe for hallucination patterns:
hallucination_scenarios = pd.DataFrame(
    [
        {
            'scenario': 'Grounded response',
            'user_query': 'What is the return policy?',
            'retrieved_documents': [
                'Returns accepted within 30 days with receipt.',
            ],
            'rag_response': 'You can return items within 30 days '
                'if you have a receipt.',
        },
        {
            'scenario': 'Fabricated details',
            'user_query': 'What is the return policy?',
            'retrieved_documents': [
                'Returns accepted within 30 days with receipt.',
            ],
            'rag_response': 'You can return items within 60 days. '
                'No receipt needed. We also offer free shipping on returns.',
        },
        {
            'scenario': 'Insufficient context',
            'user_query': 'What are the shipping costs?',
            'retrieved_documents': [
                'We ship to all 50 US states.',
            ],
            'rag_response': 'Standard shipping is $5.99 and express '
                'shipping is $12.99.',
        },
    ]
)

dataset.insert_from_pandas(
    df=hallucination_scenarios,
    input_columns=['user_query', 'retrieved_documents', 'rag_response'],
    metadata_columns=['scenario'],
)
3

Run the Diagnostic Evaluation

def passthrough_task(inputs, extras, metadata):
    return {
        'rag_response': inputs['rag_response'],
        'retrieved_documents': inputs['retrieved_documents'],
    }

result = evaluate(
    dataset=dataset,
    task=passthrough_task,
    evaluators=[
        RAGFaithfulness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
        AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
        ContextRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME),
    ],
    score_fn_kwargs_mapping={
        'user_query': lambda x: x['inputs']['user_query'],
        'retrieved_documents': 'retrieved_documents',
        'rag_response': 'rag_response',
    },
)
4

Interpret Results

Use the diagnostic workflow to classify failures:
for r in result.results:
    scores = {s.evaluator_name: s for s in r.scores}
    scenario = r.dataset_item.metadata.get('scenario', 'unknown')

    faithfulness = scores.get('rag_faithfulness')
    relevance = scores.get('answer_relevance')
    context = scores.get('context_relevance')

    # Classify the failure mode
    if faithfulness and faithfulness.value == 0:
        diagnosis = 'HALLUCINATION'
    elif context and context.value < 0.5:
        diagnosis = 'BAD RETRIEVAL'
    elif relevance and relevance.value < 0.5:
        diagnosis = 'OFF-TOPIC'
    else:
        diagnosis = 'HEALTHY'

    print(f'{scenario}: {diagnosis}')
    if faithfulness:
        print(f'  Faithfulness: {faithfulness.label}{faithfulness.reasoning}')
Expected output:
Grounded response: HEALTHY
  Faithfulness: yes — The response accurately reflects the return policy
  stated in the retrieved document.

Fabricated details: HALLUCINATION
  Faithfulness: no — The response claims a 60-day return window and no
  receipt requirement, but the source document states 30 days with receipt.

Insufficient context: HALLUCINATION
  Faithfulness: no — The response provides specific prices ($5.99, $12.99)
  that are not supported by the retrieved document.
Reading the diagnosis: The triad distinguishes why a response failed:
  • HALLUCINATION = Faithfulness fails (response fabricates information)
  • BAD RETRIEVAL = Context Relevance fails (wrong documents retrieved)
  • OFF-TOPIC = Answer Relevance fails (response doesn’t address the question)

Layer 2: Production Monitoring

For applications using Agentic Monitoring, configure Evaluator Rules to continuously evaluate production spans:
  1. Navigate to your application’s Evaluator Rules tab
  2. Add a rule for RAG Faithfulness
  3. Map evaluator inputs to your span attributes:
    • user_query → your query span attribute
    • rag_response → your response span attribute
    • retrieved_documents → your context span attribute
  4. Set alert thresholds (e.g., alert when faithfulness drops below 80%)
See Evaluator Rules for step-by-step instructions.

Combining Both Layers

The most effective hallucination detection pipeline uses both layers:
StageWhat to DoTool
DevelopmentTest against known hallucination scenariosEvals SDK + RAG Faithfulness
Pre-releaseRun experiments comparing pipeline changesEvals SDK + full diagnostic triad
ProductionContinuous monitoring with alertingEvaluator Rules or LLM Obs enrichments
InvestigationDeep-dive into flagged eventsEvals SDK .score() on specific cases

Next Steps


Source notebooks: