Skip to main content

Documentation Index

Fetch the complete documentation index at: https://handbook.fiddler.ai/llms.txt

Use this file to discover all available pages before exploring further.

Evaluate your RAG application using the RAG Health Metrics diagnostic triad to pinpoint whether issues originate in retrieval, generation, or query understanding. Time to complete: ~30 minutes

What You’ll Learn

  • Set up a RAG experiment pipeline with the Fiddler Evals SDK
  • Use Answer Relevance, Context Relevance, and RAG Faithfulness together
  • Interpret diagnostic results to identify pipeline failures
  • Distinguish between retrieval and generation problems

Prerequisites

  • Fiddler Account: Active account with API access
  • Python 3.10+
  • Fiddler Evals SDK: pip install fiddler-evals
  • Familiarity with: Experiments Getting Started

Step 1: Connect and Set Up

from fiddler_evals import init, Project, Application, Dataset
from fiddler_evals.pydantic_models.dataset import NewDatasetItem

# Initialize connection
init(
    url='https://your-org.fiddler.ai',
    token='your-access-token'
)

# Create organizational structure
project = Project.get_or_create(name='rag_health_experiments')
application = Application.get_or_create(
    name='my_rag_app',
    project_id=project.id
)

Step 2: Create a RAG Experiment Dataset

Create test cases that include user queries and retrieved documents. The quality of your evaluation depends on realistic, representative test cases.
dataset = Dataset.create(
    name='rag_health_test_cases',
    application_id=application.id,
    description='RAG Health Metrics experiment dataset'
)

test_cases = [
    # Scenario 1: Good RAG response
    NewDatasetItem(
        inputs={
            "user_query": "What are the benefits of renewable energy?",
            "retrieved_documents": "Renewable energy sources like solar and wind reduce "
                "greenhouse gas emissions, decrease dependence on fossil fuels, and can "
                "lower long-term energy costs. Solar panel costs have dropped 89% since 2010."
        },
        metadata={"scenario": "good_response"}
    ),
    # Scenario 2: Irrelevant retrieval
    NewDatasetItem(
        inputs={
            "user_query": "What are the benefits of renewable energy?",
            "retrieved_documents": "The history of the automobile dates back to the 15th "
                "century. Karl Benz patented the first true automobile in 1886."
        },
        metadata={"scenario": "bad_retrieval"}
    ),
    # Scenario 3: Hallucination risk
    NewDatasetItem(
        inputs={
            "user_query": "What is the current price of solar panels?",
            "retrieved_documents": "Solar energy adoption has grown significantly worldwide. "
                "Many countries now have solar incentive programs."
        },
        metadata={"scenario": "insufficient_context"}
    ),
]

dataset.insert(test_cases)
print(f"Added {len(test_cases)} test cases")

Step 3: Define Your RAG Task

The task function represents your RAG application. It receives inputs and returns the generated response.
def my_rag_task(inputs, extras, metadata):
    """Your RAG application logic.

    Replace this with your actual RAG pipeline:
    1. Take the user query
    2. Use the retrieved documents as context
    3. Generate a response
    """
    user_query = inputs["user_query"]
    context = inputs["retrieved_documents"]

    # Call your LLM with the query and retrieved context
    # Example: response = my_llm.generate(query=user_query, context=context)
    response = generate_rag_response(user_query, context)

    return {
        "rag_response": response,
        "retrieved_documents": context,
    }

Step 4: Run the RAG Health Experiment

Use all three evaluators together for comprehensive diagnostics:
from fiddler_evals import evaluate
from fiddler_evals.evaluators import AnswerRelevance, ContextRelevance, RAGFaithfulness

results = evaluate(
    dataset=dataset,
    task=my_rag_task,
    evaluators=[
        AnswerRelevance(model="openai/gpt-4o", credential="your-llm-credential"),      # Is the response relevant? (High/Medium/Low)
        ContextRelevance(model="openai/gpt-4o", credential="your-llm-credential"),     # Are retrieved docs relevant? (High/Medium/Low)
        RAGFaithfulness(model="openai/gpt-4o", credential="your-llm-credential"),      # Is the response grounded? (Yes/No)
    ],
    name_prefix="rag_health_baseline",
    score_fn_kwargs_mapping={
        "user_query": lambda x: x["inputs"]["user_query"],
        "rag_response": "rag_response",
        "retrieved_documents": "retrieved_documents",
    }
)

print(f"Evaluated {len(results.results)} test cases")

Step 5: Analyze Diagnostic Results

Examine the results to identify which pipeline stage is causing issues:
for result in results.results:
    print(f"\nScenario: {result.dataset_item.metadata.get('scenario', 'unknown')}")

    scores = {score.name: score for score in result.scores}

    # Extract scores
    ar_score = scores.get("answer_relevance")
    cr_score = scores.get("context_relevance")
    rf_score = scores.get("rag_faithfulness")

    if ar_score:
        print(f"  Answer Relevance: {ar_score.label} ({ar_score.value})")
        print(f"    Reasoning: {ar_score.reasoning}")

    if cr_score:
        print(f"  Context Relevance: {cr_score.label} ({cr_score.value})")
        print(f"    Reasoning: {cr_score.reasoning}")

    if rf_score:
        print(f"  RAG Faithfulness: {rf_score.label} ({rf_score.value})")
        print(f"    Reasoning: {rf_score.reasoning}")

    # Diagnostic interpretation
    if ar_score and cr_score and rf_score:
        if ar_score.value >= 0.5 and rf_score.value == 0:
            print("  Diagnosis: HALLUCINATION — response is relevant but not grounded")
        elif rf_score.value == 1 and ar_score.value < 0.5:
            print("  Diagnosis: OFF-TOPIC — response is grounded but doesn't answer the query")
        elif cr_score.value < 0.5:
            print("  Diagnosis: BAD RETRIEVAL — retrieved documents are not relevant")
        elif ar_score.value >= 0.5 and rf_score.value == 1 and cr_score.value >= 0.5:
            print("  Diagnosis: HEALTHY — all metrics indicate good RAG performance")

Step 6: Compare RAG Configurations

Use experiments to compare different RAG configurations:
# Evaluate with a different retrieval strategy
results_v2 = evaluate(
    dataset=dataset,
    task=my_improved_rag_task,  # Different retrieval or generation config
    evaluators=[
        AnswerRelevance(model="openai/gpt-4o", credential="your-llm-credential"),
        ContextRelevance(model="openai/gpt-4o", credential="your-llm-credential"),
        RAGFaithfulness(model="openai/gpt-4o", credential="your-llm-credential"),
    ],
    name_prefix="rag_health_improved",
    score_fn_kwargs_mapping={
        "user_query": lambda x: x["inputs"]["user_query"],
        "rag_response": "rag_response",
        "retrieved_documents": "retrieved_documents",
    }
)

# Compare results side-by-side in the Fiddler UI
print("Compare experiments in your Fiddler dashboard")

Understanding the Results

Score Interpretation

EvaluatorHigh ScoreLow Score
Answer RelevanceResponse directly addresses the queryResponse misses the point or is off-topic
Context RelevanceRetrieved documents support the queryRetrieved documents are irrelevant
RAG FaithfulnessResponse is grounded in contextResponse contains unsupported claims

Common Diagnostic Patterns

Answer RelevanceContext RelevanceRAG FaithfulnessDiagnosis
HighHighYesHealthy RAG pipeline
HighHighNoHallucination — fix generation
LowHighYesQuery misunderstanding — fix prompt
LowLow-Bad retrieval — fix retrieval
HighLowYesLucky generation — retrieval needs work

Next Steps