Evaluate a dataset using a task function and a list of evaluators. This is the main entry point for running evaluation experiments. It creates an experiment, runs the evaluation task on all dataset items, and executes the specified evaluators to generate scores. The function automatically:Documentation Index
Fetch the complete documentation index at: https://handbook.fiddler.ai/llms.txt
Use this file to discover all available pages before exploring further.
- Creates a new experiment with a unique name
- Runs the evaluation task on each dataset item
- Executes all evaluators on the task outputs
- Returns comprehensive results with timing and error information
- Automatic Experiment Creation: Creates experiments with unique names
- Task Execution: Runs custom evaluation tasks on dataset items
- Evaluator Orchestration: Executes multiple evaluators on outputs
- Error Handling: Gracefully handles task and evaluator failures
- Result Collection: Returns detailed results with timing information
- Flexible Configuration: Supports custom parameter mapping for evaluators
- Concurrent Processing: Supports concurrent processing of dataset items
- Model Evaluation: Evaluate LLM models on test datasets
- A/B Testing: Compare different model versions or configurations
- Quality Assurance: Validate model performance across different inputs
- Benchmarking: Run standardized evaluations on multiple models
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
dataset | Dataset | ✗ | None | The dataset containing test cases to evaluate. |
task | Callable[[Dict[str, Any], Dict[str, Any], Dict[str, Any]], Dict[str, Any]] | ✗ | None | Function that processes dataset items and returns outputs. Must accept (inputs, extras, metadata) and return dict of outputs. |
evaluators | `list[Evaluator | Callable]` | ✗ | None |
name_prefix | `str | None` | ✗ | None |
description | `str | None` | ✗ | None |
metadata | `dict | None` | ✗ | None |
score_fn_kwargs_mapping | `Dict[str, str | Callable[[Dict[str, Any]], Any]] | None` | ✗ | None |
max_workers | int | ✗ | 1 | Maximum number of workers to use for concurrent processing. Use more than 1 only if the eval task function is thread-safe. |
Returns
List of ExperimentItemResult objects, each containing : the experiment item data and scores for one dataset item. Return type: ExperimentResultRaises
- ValueError — If dataset is empty or evaluators are invalid.
- RuntimeError — If no connection is available for API calls.
- ApiError — If there’s an error creating the experiment or communicating with the Fiddler API.
Example
The function processes dataset items sequentially. For large datasets, consider implementing parallel processing or batch processing strategies. The experiment name is automatically made unique by appending datetime.
- Evaluator-level score_fn_kwargs_mapping (set in evaluator constructor)
- Evaluation-level score_fn_kwargs_mapping (passed to evaluate function)
- Default parameter resolution