Documentation Index
Fetch the complete documentation index at: https://handbook.fiddler.ai/llms.txt
Use this file to discover all available pages before exploring further.
Create a fully customizable LLM-as-a-Judge evaluator with your own prompt and output schema.
The CustomJudge evaluator allows you to define arbitrary evaluation criteria by specifying a custom prompt template and structured output fields. This is the most flexible evaluator in the Fiddler Evals SDK, enabling you to build domain-specific evaluation logic without writing custom code.
Key Features:
- Custom Prompts: Define your own evaluation prompt with
{{ placeholder }} syntax
- Structured Outputs: Specify typed output fields (string, boolean, integer, number)
- Categorical Choices: Constrain string outputs to predefined categories
- Multi-Field Outputs: Return multiple scores/labels from a single evaluation
- Field Descriptions: Guide the LLM with descriptions for each output field
Use Cases:
- Domain-Specific Evaluation: Create evaluators tailored to your industry or use case
- Custom Rubrics: Implement grading rubrics with specific criteria
- Multi-Aspect Scoring: Evaluate multiple dimensions (e.g., tone, accuracy, helpfulness)
- Classification Tasks: Categorize responses into predefined labels
- Compliance Checking: Verify responses meet specific guidelines or policies
Output Field Types:
- string: Free-form text output, or categorical if
choices is specified
- boolean: True/False classification
- integer: Whole number scores (e.g., 1-5 rating scale)
- number: Floating-point scores (e.g., 0.0-1.0 confidence)
Parameters
| Parameter | Type | Required | Default | Description |
|---|
prompt_template | str | ✗ | None | The evaluation prompt with {{ placeholder }} markers for dynamic content. Placeholders are filled from the inputs dict passed to the score() method. |
output_fields | Dict[str, OutputField] | ✗ | None | Schema defining the expected outputs. Each field has: type: One of ‘string’, ‘boolean’, ‘integer’, ‘number’; choices (optional): List of allowed values for categorical string fields; description (optional): Instructions for the LLM about this field |
model | str | ✗ | None | LLM Gateway model name in {provider}/{model} format. E.g., openai/gpt-4o, anthropic/claude-3-sonnet |
credential | str, optional | ✗ | None | Name of the LLM Gateway credential for the provider. |
Returns
A list of Score objects, one for each output field defined. : Each Score contains:
- name: The output field name (e.g., “sentiment”, “confidence”)
- value: The numeric value (for number/integer/boolean fields)
- label: The string label (for string/categorical fields)
- reasoning: Always None for CustomJudge (reasoning is returned as a field)
Return type: list[Score]
Example
Basic sentiment analysis with categorical output:
from fiddler_evals.evaluators import CustomJudge
evaluator = CustomJudge(
model="openai/gpt-4o",
credential="my-openai-key",
prompt_template="""
Analyze the sentiment of the following customer review:
Review: {{ review_text }}
Classify the sentiment and explain your reasoning.
""",
output_fields={
"sentiment": {
"type": "string",
"choices": ["positive", "negative", "neutral"],
},
"confidence": {
"type": "number",
"description": "Confidence score between 0 and 1"
},
"reasoning": {
"type": "string",
}
}
)
scores = evaluator.score(inputs={
"review_text": "The product exceeded my expectations! Fast shipping too."
})
# Access individual scores by index or iterate
for score in scores:
print(f"{score.name}: {score.value or score.label}")
# Output:
# sentiment: positive
# confidence: 0.95
# reasoning: The review expresses satisfaction...
Example
Multi-criteria response quality evaluation:
evaluator = CustomJudge(
model="anthropic/claude-3-sonnet",
credential="my-anthropic-key",
prompt_template="""
Evaluate the quality of this customer support response.
Customer Question: {{ question }}
Support Response: {{ response }}
Rate the response on multiple criteria.
""",
output_fields={
"helpful": {
"type": "boolean",
"description": "Does the response address the customer's question?"
},
"professional_tone": {
"type": "boolean",
"description": "Is the tone professional and courteous?"
},
"quality_score": {
"type": "integer",
"description": "Overall quality rating from 1 (poor) to 5 (excellent)"
}
}
)
scores = evaluator.score(inputs={
"question": "How do I reset my password?",
"response": "Click 'Forgot Password' on the login page and follow the steps."
})
# Convert to dict for easy access
scores_dict = {s.name: s for s in scores}
print(f"Helpful: {scores_dict['helpful'].value}") # True
print(f"Quality: {scores_dict['quality_score'].value}") # 4
Example
Code review evaluator:
evaluator = CustomJudge(
model="openai/gpt-4o",
credential="my-openai-key",
prompt_template="""
Review this code change for potential issues:
```{{ language }}
```
Context: {{ pr_description }}
""",
output_fields={
"has_bugs": {
"type": "boolean",
"description": "Are there any obvious bugs or logic errors?"
},
"severity": {
"type": "string",
"choices": ["critical", "major", "minor", "none"],
"description": "Severity of issues found"
},
"feedback": {
"type": "string",
"description": "Specific feedback for the code author"
}
}
)
```python
{{ code_diff }}
<div data-gb-custom-block data-tag="hint" data-style='info'>
- Placeholder names in `{{ }}` must exactly match keys in the `inputs` dict
- The LLM is instructed to return JSON matching your output schema
- For best results, include clear descriptions for each output field
- Use `choices` for categorical fields to ensure consistent outputs
- This evaluator requires an active connection to the Fiddler API
</div>
#### name *= 'custom_judge'*
#### score()
Score using the Custom Judge.
#### Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `inputs` | `Dict[str, Any]` | ✗ | `None` | Values for the {{ placeholders }} in your prompt_template. Keys must match placeholder names exactly. |
#### Returns
A list of Score objects, one for each output field defined.
**Return type:** list[Score]
#### Raises
**ValueError** -- If inputs is empty.