RAG Evaluation Fundamentals - Fiddler Documentation

Evaluate your RAG application’s retrieval and generation quality using Fiddler’s built-in evaluators. This cookbook demonstrates the direct .score() API for rapid iteration on test cases before scaling to full experiments. Use this cookbook when: You have a RAG application and want to quickly assess whether responses are faithful to retrieved documents and relevant to user queries. Time to complete: ~15 minutes

Prerequisites

Fiddler account with API access
LLM credential configured in Settings > LLM Gateway
pip install fiddler-evals pandas

Connect and Initialize Evaluators

Replace URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.

import pandas as pd
from fiddler_evals import init
from fiddler_evals.evaluators import RAGFaithfulness, AnswerRelevance

URL = 'https://your-org.fiddler.ai'
TOKEN = 'your-access-token'
LLM_CREDENTIAL_NAME = 'your-llm-credential'   # From Settings > LLM Gateway
LLM_MODEL_NAME = 'openai/gpt-4o'              # Or your preferred model

init(url=URL, token=TOKEN)

# Initialize evaluators
faithfulness = RAGFaithfulness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)
relevance = AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)

Create Test Cases

Define representative test cases that cover both successful and failing RAG scenarios:

test_cases = pd.DataFrame(
    [
        {
            'scenario': 'Perfect Match',
            'user_query': 'What is the capital of France?',
            'retrieved_documents': ['Paris is the capital of France.'],
            'rag_response': 'The capital of France is Paris.',
        },
        {
            'scenario': 'Hallucination',
            'user_query': 'What are the office hours?',
            'retrieved_documents': ['We are closed on weekends.'],
            'rag_response': 'We are open 9 AM to 5 PM every day.',
        },
        {
            'scenario': 'Irrelevant Answer',
            'user_query': 'How do I reset my password?',
            'retrieved_documents': ['To reset, click "Forgot Password".'],
            'rag_response': 'Our system is very secure and uses 256-bit encryption.',
        },
    ]
)

Evaluate Each Test Case

Use the .score() method to evaluate each test case directly. Each evaluator returns a Score object with value, label, and reasoning:

def evaluate_row(row):
    f_score = faithfulness.score(
        user_query=row['user_query'],
        rag_response=row['rag_response'],
        retrieved_documents=row['retrieved_documents'],
    )

    r_score = relevance.score(
        user_query=row['user_query'],
        rag_response=row['rag_response'],
    )

    return pd.Series(
        {
            'Faithfulness': f_score.label,
            'Relevance': r_score.label,
            'Status': 'HEALTHY'
            if f_score.label == 'yes' and r_score.value >= 0.5
            else 'ISSUE DETECTED',
        }
    )

results = test_cases.join(test_cases.apply(evaluate_row, axis=1))

View Results

results[['scenario', 'Faithfulness', 'Relevance', 'Status']]

Expected output:

scenario	Faithfulness	Relevance	Status
Perfect Match	yes	high	HEALTHY
Hallucination	no	high	ISSUE DETECTED
Irrelevant Answer	yes	low	ISSUE DETECTED

The hallucination case scores high on relevance (it addresses the question) but fails faithfulness (the response fabricates hours not in the context). The irrelevant answer is faithful to the context but doesn’t actually answer the user’s question.

Understanding the Evaluators

RAG Faithfulness

RAG Faithfulness checks whether the response is grounded in the retrieved documents.

Inputs: user_query, rag_response, retrieved_documents
Scoring: Binary — Yes (1.0) / No (0.0)
Use for: Detecting hallucinations where the LLM generates plausible but unsupported claims

Answer Relevance

Answer Relevance measures how well the response addresses the user’s query.

Inputs: user_query, rag_response (+ optional retrieved_documents)
Scoring: Ordinal — High (1.0), Medium (0.5), Low (0.0)
Use for: Detecting off-topic responses where the LLM answers a different question

RAG Faithfulness vs. FTL Faithfulness: This cookbook uses RAGFaithfulness, an LLM-as-a-Judge evaluator. Fiddler also offers FTLResponseFaithfulness, a proprietary Fast Trust Model evaluator with different inputs (context, response) and probability-based scoring (faithful_prob 0.0–1.0). These are separate evaluators — see the Evaluators Glossary for details.

Next Steps

Running RAG Experiments at Scale — Use Datasets and Experiments to evaluate systematically across larger test sets
Detecting Hallucinations in RAG — Set up continuous hallucination monitoring in production
RAG Health Diagnostics — Conceptual guide to the diagnostic triad

Source notebook: Fiddler Cookbook: RAG Evaluation Fundamentals

Documentation Index

​Understanding the Evaluators

​RAG Faithfulness

​Answer Relevance

​Next Steps

Understanding the Evaluators

RAG Faithfulness

Answer Relevance

Next Steps