Monitoring Agentic Content Generation

Ensure quality, safety, and brand compliance in content generation agents using a combination of Fiddler’s built-in evaluators for baseline quality and custom CustomJudge evaluators for domain-specific governance. Use this cookbook when: You have content generation agents (writing reports, customer communications, marketing copy) and need automated quality gates to replace manual review of every draft. Time to complete: ~20 minutes

Prerequisites

Fiddler account with API access
LLM credential configured in Settings > LLM Gateway
pip install fiddler-evals pandas

The Content Generation Challenge

Enterprise content generation agents produce volume that exceeds human review capacity. Without automated quality gates, teams face:

Reviewer fatigue — manually reviewing hundreds of drafts per day
Inconsistent quality — different reviewers apply different standards
Brand drift — subtle changes in tone or style go undetected

The solution: combine Fiddler’s built-in evaluators (quality, safety) with custom LLM-as-a-Judge evaluators (brand voice, compliance) for automated governance.

Recommended Evaluators

Built-In Evaluators (Baseline Quality)

Evaluator	What It Measures	Value
Answer Relevance	Does the output address the input instruction?	Instruction adherence
Coherence	Logical flow and clarity	Narrative quality
Conciseness	Brevity without losing meaning	Message clarity
Sentiment	Positive, negative, or neutral tone	Brand alignment
Prompt Safety	11 safety dimensions (toxicity, bias, etc.)	Risk mitigation

Custom Evaluators (Domain-Specific Governance)

Evaluator	What It Measures	Value
Brand Voice Match	Adherence to company style guide	Automated brand governance
Bias Detection	Potential bias across multiple dimensions	Compliance and risk mitigation

Set Up Built-In Evaluators

Replace URL, TOKEN, and credential names with your Fiddler account details. Find your credentials in Settings > Access Tokens and Settings > LLM Gateway.

import pandas as pd
from fiddler_evals import init
from fiddler_evals.evaluators import (
    AnswerRelevance,
    Coherence,
    Conciseness,
    CustomJudge,
)

URL = 'https://your-org.fiddler.ai'
TOKEN = 'your-access-token'
LLM_CREDENTIAL_NAME = 'your-llm-credential'
LLM_MODEL_NAME = 'openai/gpt-4o'

init(url=URL, token=TOKEN)

# Built-in evaluators for baseline quality
relevance = AnswerRelevance(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)
coherence = Coherence(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)
conciseness = Conciseness(model=LLM_MODEL_NAME, credential=LLM_CREDENTIAL_NAME)

Create a Brand Voice Match Judge

Use CustomJudge to evaluate content against your company’s style guide:

brand_voice_judge = CustomJudge(
    prompt_template="""
        Determine whether the provided content adheres to the provided
        brand guidelines.

        Content: {{ content }}
        Brand Guidelines: {{ brand_guidelines }}
    """,
    output_fields={
        'voice_match_score': {
            'type': 'string',
            'choices': ['Perfect Match', 'Minor Deviations', 'Off-Brand'],
        },
        'reasoning': {'type': 'string'},
    },
    model=LLM_MODEL_NAME,
    credential=LLM_CREDENTIAL_NAME,
)

See Building Custom Judge Evaluators for a deep-dive into prompt_template, output_fields, and iterative prompt improvement.

Evaluate Generated Content

# Example: your brand guidelines
brand_guidelines = (
    "Use professional, approachable tone. "
    "Address customers as 'you'. "
    "Avoid jargon, slang, and exclamation marks. "
    "Keep sentences under 25 words."
)

# Sample content from your agent
generated_content = [
    {
        'instruction': 'Write a welcome email for new customers',
        'content': 'Welcome to our platform. We are glad you chose us. '
            'Your account is ready and you can start exploring features '
            'right away.',
    },
    {
        'instruction': 'Write a welcome email for new customers',
        'content': 'OMG WELCOME!!! You are going to LOVE this!! '
            'Our platform is literally the BEST thing ever!!!',
    },
    {
        'instruction': 'Explain the refund process',
        'content': 'To request a refund, navigate to your order history, '
            'select the item, and click Request Refund. Processing takes '
            '3-5 business days.',
    },
]

# Evaluate each piece of content
for item in generated_content:
    # Built-in evaluators
    rel_score = relevance.score(
        user_query=item['instruction'],
        rag_response=item['content'],
    )
    coh_score = coherence.score(prompt=item['instruction'], response=item['content'])
    con_score = conciseness.score(response=item['content'])

    # Custom brand voice judge
    brand_scores = brand_voice_judge.score(inputs={
        'content': item['content'],
        'brand_guidelines': brand_guidelines,
    })
    brand_dict = {s.name: s for s in brand_scores}

    print(f"\nInstruction: {item['instruction'][:50]}...")
    print(f"  Relevance:  {rel_score.label} ({rel_score.value})")
    print(f"  Coherence:  {coh_score.label}")
    print(f"  Conciseness: {con_score.label}")
    print(f"  Brand Voice: {brand_dict['voice_match_score'].label}")
    print(f"    Reason: {brand_dict['reasoning'].label}")

Expected output:

Instruction: Write a welcome email for new customers...
  Relevance:  high (1.0)
  Coherence:  high
  Conciseness: high
  Brand Voice: Perfect Match
    Reason: Professional tone, addresses customer directly, no jargon or
    exclamation marks, sentences are concise.

Instruction: Write a welcome email for new customers...
  Relevance:  medium (0.5)
  Coherence:  low
  Conciseness: low
  Brand Voice: Off-Brand
    Reason: Uses all-caps, multiple exclamation marks, slang ("OMG",
    "literally"), and informal tone — violates all brand guidelines.

Instruction: Explain the refund process...
  Relevance:  high (1.0)
  Coherence:  high
  Conciseness: high
  Brand Voice: Perfect Match
    Reason: Clear, professional instructions with appropriate tone and
    sentence length.

Build a Quality Gate

Combine evaluator scores into an automated quality gate that flags content for human review:

def quality_gate(instruction, content, brand_guidelines):
    """Automated quality gate for content generation agents.

    Returns 'APPROVED', 'REVIEW', or 'REJECTED' with reasons.
    """
    issues = []

    # Check relevance
    rel = relevance.score(user_query=instruction, rag_response=content)
    if rel.value < 0.5:
        issues.append(f'Low relevance ({rel.label})')

    # Check coherence
    coh = coherence.score(prompt=instruction, response=content)
    if coh.value < 0.5:
        issues.append(f'Low coherence ({coh.label})')

    # Check brand voice
    brand = brand_voice_judge.score(inputs={
        'content': content,
        'brand_guidelines': brand_guidelines,
    })
    brand_dict = {s.name: s for s in brand}
    voice = brand_dict['voice_match_score'].label
    if voice == 'Off-Brand':
        issues.append(f'Off-brand content')
    elif voice == 'Minor Deviations':
        issues.append(f'Minor brand deviations')

    if not issues:
        return 'APPROVED', []
    elif any('Off-Brand' in i or 'Low' in i for i in issues):
        return 'REJECTED', issues
    else:
        return 'REVIEW', issues

# Run the quality gate
for item in generated_content:
    status, issues = quality_gate(
        item['instruction'], item['content'], brand_guidelines,
    )
    print(f"{status}: {item['content'][:60]}...")
    if issues:
        print(f"  Issues: {', '.join(issues)}")

Expected output:

APPROVED: Welcome to our platform. We are glad you chose us. Your ...
REJECTED: OMG WELCOME!!! You are going to LOVE this!! Our platform...
  Issues: Low coherence (low), Off-brand content
APPROVED: To request a refund, navigate to your order history, sel...

Production Monitoring

To deploy these evaluators in production:

Evaluator Rules: Configure built-in evaluators (Answer Relevance, Coherence, Conciseness) as Evaluator Rules in your Agentic Monitoring application. See Evaluator Rules.
Custom Judges in Experiments: Run the Brand Voice Match judge as a recurring experiment against sampled production outputs to track brand compliance over time.
Alerting: Set up alerts on evaluator score degradation to catch systemic quality drift after model updates or prompt changes.

Next Steps

Building Custom Judge Evaluators — Deep-dive into CustomJudge capabilities
Evaluator Rules — Deploy evaluators in production
Evals SDK Integration — Integration patterns for agentic workflows

Related: Evaluator Rules — Configure evaluators for production monitoring

Documentation Index

​The Content Generation Challenge

​Recommended Evaluators

​Built-In Evaluators (Baseline Quality)

​Custom Evaluators (Domain-Specific Governance)

​Production Monitoring

​Next Steps

The Content Generation Challenge

Recommended Evaluators

Built-In Evaluators (Baseline Quality)

Custom Evaluators (Domain-Specific Governance)

Production Monitoring

Next Steps