Evaluator
The evaluator classes for automatic evaluation.
Evaluator classes
The main entry point for using the evaluator:
evaluate.evaluator
< source >( task: str = None ) → Evaluator
Parameters
- 
							task (str) — The task defining which evaluator will be returned. Currently accepted tasks are:- "image-classification": will return a ImageClassificationEvaluator.
- "question-answering": will return a QuestionAnsweringEvaluator.
- "text-classification"(alias- "sentiment-analysis"available): will return a TextClassificationEvaluator.
- "token-classification": will return a TokenClassificationEvaluator.
 
Returns
An evaluator suitable for the task.
Utility factory method to build an Evaluator.
Evaluators encapsulate a task and a default metric name. They leverage pipeline functionalify from transformers
to simplify the evaluation of multiple combinations of models, datasets and metrics for a given task.
The base class for all evaluator classes:
The Evaluator class is the class from which all evaluators inherit. Refer to this class for methods shared across different evaluators. Base class implementing evaluator operations.
check_required_columns
< source >( data: typing.Union[str, datasets.arrow_dataset.Dataset] columns_names: typing.Dict[str, str] )
Ensure the columns required for the evaluation are present in the dataset.
compute_metric
< source >( metric: EvaluationModule metric_inputs: typing.Dict strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 random_state: typing.Optional[int] = None )
Compute and return metrics.
get_dataset_split
< source >(
			data
				subset = None
				split = None
				
			)
			→
				split
Infers which split to use if None is given.
load_data
< source >(
			data: typing.Union[str, datasets.arrow_dataset.Dataset]
				subset: str = None
				split: str = None
				
			)
			→
				data (Dataset)
Parameters
- 
							data (Datasetorstr, defaults to None) — Specifies the dataset we will run evaluation on. If it is of
- 
							type str, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset. —
- 
							subset (str, defaults to None) — Specifies dataset subset to be passed tonameinload_dataset. To be used with datasets with several configurations (e.g. glue/sst2).
- 
							split (str, defaults to None) — User-defined dataset split by name (e.g. train, validation, test). Supports slice-split (test[:n]). If not defined and data is astrtype, will automatically select the best one viachoose_split().
Returns
data (Dataset)
Loaded dataset which will be used for evaluation.
Load dataset with given subset and split.
A core method of the Evaluator class, which processes the pipeline outputs for compatibility with the metric.
prepare_data
< source >(
			data: Dataset
				input_column: str
				label_column: str
				*args
				**kwargs
				
			)
			→
				dict
Parameters
- 
							data (Dataset) — Specifies the dataset we will run evaluation on.
- 
							input_column (str, defaults to"text") — the name of the column containing the text feature in the dataset specified bydata.
- 
							label_column (str, defaults to"label") — the name of the column containing the labels in the dataset specified bydata.
Returns
dict
metric inputs.
list:  pipeline inputs.
Prepare data.
prepare_metric
< source >( metric: typing.Union[str, evaluate.module.EvaluationModule] )
Prepare metric.
prepare_pipeline
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None feature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None device: int = None )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, —
- 
							defaults to None) — If the argument in not specified, we initialize the default pipeline for the task. If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							preprocessor (PreTrainedTokenizerBaseorFeatureExtractionMixin, optional, defaults toNone) — Argument can be used to overwrite a default preprocessor ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
Prepare pipeline.
The task specific evaluators
ImageClassificationEvaluator
class evaluate.ImageClassificationEvaluator
< source >( task = 'image-classification' default_metric_name = None )
Image classification evaluator.
This image classification evaluator can currently be loaded from evaluator() using the default task name
image-classification.
Methods in this class assume a data format compatible with the ImageClassificationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'image' label_column: str = 'label' label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, defaults toNone) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classificationor its alias -sentiment-analysis). If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							data (strorDataset, defaults toNone) — Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- 
							subset (str, defaults toNone) — Defines which dataset subset to load. IfNoneis passed the default subset is loaded.
- 
							split (str, defaults toNone) — Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.
- 
							metric (strorEvaluationModule, defaults toNone) — Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
- 
							tokenizer (strorPreTrainedTokenizer, optional, defaults toNone) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
- 
							strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:- "simple"- we evaluate the metric and return the scores.
- "bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using- scipy’s- bootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
 
- 
							confidence_level (float, defaults to0.95) — Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							n_resamples (int, defaults to9999) — Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							device (int, defaults toNone) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.
- 
							random_state (int, optional, defaults toNone) — Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("image-classification")
>>> data = load_dataset("beans", split="test[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="nateraw/vit-base-beans",
>>>     data=data,
>>>     label_column="labels",
>>>     metric="accuracy",
>>>     label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
>>>     strategy="bootstrap"
>>> )QuestionAnsweringEvaluator
class evaluate.QuestionAnsweringEvaluator
< source >( task = 'question-answering' default_metric_name = None )
Question answering evaluator. This evaluator handles extractive question answering, where the answer to the question is extracted from a context.
This question answering evaluator can currently be loaded from evaluator() using the default task name
question-answering.
Methods in this class assume a data format compatible with the
QuestionAnsweringPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None question_column: str = 'question' context_column: str = 'context' id_column: str = 'id' label_column: str = 'answers' squad_v2_format: typing.Optional[bool] = None )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, defaults toNone) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classificationor its alias -sentiment-analysis). If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							data (strorDataset, defaults toNone) — Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- 
							subset (str, defaults toNone) — Defines which dataset subset to load. IfNoneis passed the default subset is loaded.
- 
							split (str, defaults toNone) — Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.
- 
							metric (strorEvaluationModule, defaults toNone) — Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
- 
							tokenizer (strorPreTrainedTokenizer, optional, defaults toNone) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
- 
							strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:- "simple"- we evaluate the metric and return the scores.
- "bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using- scipy’s- bootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
 
- 
							confidence_level (float, defaults to0.95) — Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							n_resamples (int, defaults to9999) — Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							device (int, defaults toNone) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.
- 
							random_state (int, optional, defaults toNone) — Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
>>>     data=data,
>>>     metric="squad",
>>> )Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass squad_v2_format=True to
the compute() call.
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("question-answering")
>>> data = load_dataset("squad_v2", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="mrm8488/bert-tiny-finetuned-squadv2",
>>>     data=data,
>>>     metric="squad_v2",
>>>     squad_v2_format=True,
>>> )TextClassificationEvaluator
class evaluate.TextClassificationEvaluator
< source >( task = 'text-classification' default_metric_name = None )
Text classification evaluator.
This text classification evaluator can currently be loaded from evaluator() using the default task name
text-classification or with a "sentiment-analysis" alias.
Methods in this class assume a data format compatible with the TextClassificationPipeline - a single textual
feature as input and a categorical label as output.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' second_input_column: typing.Optional[str] = None label_column: str = 'label' label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, defaults toNone) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classificationor its alias -sentiment-analysis). If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							data (strorDataset, defaults toNone) — Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- 
							subset (str, defaults toNone) — Defines which dataset subset to load. IfNoneis passed the default subset is loaded.
- 
							split (str, defaults toNone) — Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.
- 
							metric (strorEvaluationModule, defaults toNone) — Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
- 
							tokenizer (strorPreTrainedTokenizer, optional, defaults toNone) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
- 
							strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:- "simple"- we evaluate the metric and return the scores.
- "bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using- scipy’s- bootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
 
- 
							confidence_level (float, defaults to0.95) — Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							n_resamples (int, defaults to9999) — Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							device (int, defaults toNone) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.
- 
							random_state (int, optional, defaults toNone) — Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("text-classification")
>>> data = load_dataset("imdb", split="test[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
>>>     data=data,
>>>     metric="accuracy",
>>>     label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
>>>     strategy="bootstrap",
>>>     n_resamples=10,
>>>     random_state=0
>>> )TokenClassificationEvaluator
class evaluate.TokenClassificationEvaluator
< source >( task = 'token-classification' default_metric_name = None )
Token classification evaluator.
This token classification evaluator can currently be loaded from evaluator() using the default task name
token-classification.
Methods in this class assume a data format compatible with the TokenClassificationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: str = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: typing.Optional[int] = None random_state: typing.Optional[int] = None input_column: str = 'tokens' label_column: str = 'ner_tags' join_by: typing.Optional[str] = ' ' )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, defaults toNone) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classificationor its alias -sentiment-analysis). If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							data (strorDataset, defaults toNone) — Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- 
							subset (str, defaults toNone) — Defines which dataset subset to load. IfNoneis passed the default subset is loaded.
- 
							split (str, defaults toNone) — Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.
- 
							metric (strorEvaluationModule, defaults toNone) — Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
- 
							tokenizer (strorPreTrainedTokenizer, optional, defaults toNone) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
- 
							strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:- "simple"- we evaluate the metric and return the scores.
- "bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using- scipy’s- bootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
 
- 
							confidence_level (float, defaults to0.95) — Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							n_resamples (int, defaults to9999) — Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							device (int, defaults toNone) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.
- 
							random_state (int, optional, defaults toNone) — Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
The dataset input and label columns are expected to be formatted as a list of words and a list of labels respectively, following conll2003 dataset. Datasets whose inputs are single strings, and labels are a list of offset are not supported.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("token-classification")
>>> data = load_dataset("conll2003", split="validation[:2]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
>>>     data=data,
>>>     metric="seqeval",
>>> )For example, the following dataset format is accepted by the evaluator:
dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
        "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
    },
    features=Features({
        "tokens": Sequence(feature=Value(dtype="string")),
        "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
        }),
)For example, the following dataset format is not accepted by the evaluator:
dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New York is a city and Felix a person."]],
        "starts": [[0, 23]],
        "ends": [[7, 27]],
        "ner_tags": [["LOC", "PER"]],
    },
    features=Features({
        "tokens": Value(dtype="string"),
        "starts": Sequence(feature=Value(dtype="int32")),
        "ends": Sequence(feature=Value(dtype="int32")),
        "ner_tags": Sequence(feature=Value(dtype="string")),
    }),
)TextGenerationEvaluator
class evaluate.TextGenerationEvaluator
< source >( task = 'text-generation' default_metric_name = None predictions_prefix: str = 'generated' )
Text generation evaluator.
This Text generation evaluator can currently be loaded from evaluator() using the default task name
text-generation.
Methods in this class assume a data format compatible with the TextGenerationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None )
Text2TextGenerationEvaluator
class evaluate.Text2TextGenerationEvaluator
< source >( task = 'text2text-generation' default_metric_name = None )
Text2Text generation evaluator.
This Text2Text generation evaluator can currently be loaded from evaluator() using the default task name
text2text-generation.
Methods in this class assume a data format compatible with the Text2TextGenerationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, defaults toNone) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classificationor its alias -sentiment-analysis). If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							data (strorDataset, defaults toNone) — Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- 
							subset (str, defaults toNone) — Defines which dataset subset to load. IfNoneis passed the default subset is loaded.
- 
							split (str, defaults toNone) — Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.
- 
							metric (strorEvaluationModule, defaults toNone) — Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
- 
							tokenizer (strorPreTrainedTokenizer, optional, defaults toNone) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
- 
							strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:- "simple"- we evaluate the metric and return the scores.
- "bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using- scipy’s- bootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
 
- 
							confidence_level (float, defaults to0.95) — Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							n_resamples (int, defaults to9999) — Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							device (int, defaults toNone) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.
- 
							random_state (int, optional, defaults toNone) — Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.
- 
							input_column (str, defaults to"text") — the name of the column containing the input text in the dataset specified bydata.
- 
							label_column (str, defaults to"label") — the name of the column containing the labels in the dataset specified bydata.
- 
							generation_kwargs (Dict, optional, defaults toNone) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
SummarizationEvaluator
class evaluate.SummarizationEvaluator
< source >( task = 'summarization' default_metric_name = None )
Text summarization evaluator.
This text summarization evaluator can currently be loaded from evaluator() using the default task name
summarization.
Methods in this class assume a data format compatible with the SummarizationEvaluator.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, defaults toNone) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classificationor its alias -sentiment-analysis). If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							data (strorDataset, defaults toNone) — Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- 
							subset (str, defaults toNone) — Defines which dataset subset to load. IfNoneis passed the default subset is loaded.
- 
							split (str, defaults toNone) — Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.
- 
							metric (strorEvaluationModule, defaults toNone) — Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
- 
							tokenizer (strorPreTrainedTokenizer, optional, defaults toNone) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
- 
							strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:- "simple"- we evaluate the metric and return the scores.
- "bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using- scipy’s- bootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
 
- 
							confidence_level (float, defaults to0.95) — Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							n_resamples (int, defaults to9999) — Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							device (int, defaults toNone) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.
- 
							random_state (int, optional, defaults toNone) — Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.
- 
							input_column (str, defaults to"text") — the name of the column containing the input text in the dataset specified bydata.
- 
							label_column (str, defaults to"label") — the name of the column containing the labels in the dataset specified bydata.
- 
							generation_kwargs (Dict, optional, defaults toNone) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
TranslationEvaluator
Translation evaluator.
This translation generation evaluator can currently be loaded from evaluator() using the default task name
translation.
Methods in this class assume a data format compatible with the TranslationPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'text' label_column: str = 'label' generation_kwargs: dict = None )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, defaults toNone) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classificationor its alias -sentiment-analysis). If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							data (strorDataset, defaults toNone) — Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- 
							subset (str, defaults toNone) — Defines which dataset subset to load. IfNoneis passed the default subset is loaded.
- 
							split (str, defaults toNone) — Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.
- 
							metric (strorEvaluationModule, defaults toNone) — Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
- 
							tokenizer (strorPreTrainedTokenizer, optional, defaults toNone) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
- 
							strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:- "simple"- we evaluate the metric and return the scores.
- "bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using- scipy’s- bootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
 
- 
							confidence_level (float, defaults to0.95) — Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							n_resamples (int, defaults to9999) — Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							device (int, defaults toNone) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.
- 
							random_state (int, optional, defaults toNone) — Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.
- 
							input_column (str, defaults to"text") — the name of the column containing the input text in the dataset specified bydata.
- 
							label_column (str, defaults to"label") — the name of the column containing the labels in the dataset specified bydata.
- 
							generation_kwargs (Dict, optional, defaults toNone) — The generation kwargs are passed to the pipeline and set the text generation strategy.
Compute the metric for a given pipeline and dataset combination.
AutomaticSpeechRecognitionEvaluator
class evaluate.AutomaticSpeechRecognitionEvaluator
< source >( task = 'automatic-speech-recognition' default_metric_name = None )
Automatic speech recognition evaluator.
This automatic speech recognition evaluator can currently be loaded from evaluator() using the default task name
automatic-speech-recognition.
Methods in this class assume a data format compatible with the AutomaticSpeechRecognitionPipeline.
compute
< source >( model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None data: typing.Union[str, datasets.arrow_dataset.Dataset] = None subset: typing.Optional[str] = None split: typing.Optional[str] = None metric: typing.Union[str, evaluate.module.EvaluationModule] = None tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None strategy: typing.Literal['simple', 'bootstrap'] = 'simple' confidence_level: float = 0.95 n_resamples: int = 9999 device: int = None random_state: typing.Optional[int] = None input_column: str = 'path' label_column: str = 'sentence' generation_kwargs: dict = None )
Parameters
- 
							model_or_pipeline (strorPipelineorCallableorPreTrainedModelorTFPreTrainedModel, defaults toNone) — If the argument in not specified, we initialize the default pipeline for the task (in this casetext-classificationor its alias -sentiment-analysis). If the argument is of the typestror is a model instance, we use it to initialize a newPipelinewith the given model. Otherwise we assume the argument specifies a pre-initialized pipeline.
- 
							data (strorDataset, defaults toNone) — Specifies the dataset we will run evaluation on. If it is of typestr, we treat it as the dataset name, and load it. Otherwise we assume it represents a pre-loaded dataset.
- 
							subset (str, defaults toNone) — Defines which dataset subset to load. IfNoneis passed the default subset is loaded.
- 
							split (str, defaults toNone) — Defines which dataset split to load. IfNoneis passed, infers based on thechoose_splitfunction.
- 
							metric (strorEvaluationModule, defaults toNone) — Specifies the metric we use in evaluator. If it is of typestr, we treat it as the metric name, and load it. Otherwise we assume it represents a pre-loaded metric.
- 
							tokenizer (strorPreTrainedTokenizer, optional, defaults toNone) — Argument can be used to overwrite a default tokenizer ifmodel_or_pipelinerepresents a model for which we build a pipeline. Ifmodel_or_pipelineisNoneor a pre-initialized pipeline, we ignore this argument.
- 
							strategy (Literal["simple", "bootstrap"], defaults to “simple”) — specifies the evaluation strategy. Possible values are:- "simple"- we evaluate the metric and return the scores.
- "bootstrap"- on top of computing the metric scores, we calculate the confidence interval for each of the returned metric keys, using- scipy’s- bootstrapmethod https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html.
 
- 
							confidence_level (float, defaults to0.95) — Theconfidence_levelvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							n_resamples (int, defaults to9999) — Then_resamplesvalue passed tobootstrapif"bootstrap"strategy is chosen.
- 
							device (int, defaults toNone) — Device ordinal for CPU/GPU support of the pipeline. Setting this to -1 will leverage CPU, a positive integer will run the model on the associated CUDA device ID. IfNoneis provided it will be inferred and CUDA:0 used if available, CPU otherwise.
- 
							random_state (int, optional, defaults toNone) — Therandom_statevalue passed tobootstrapif"bootstrap"strategy is chosen. Useful for debugging.
Compute the metric for a given pipeline and dataset combination.
Examples:
>>> from evaluate import evaluator
>>> from datasets import load_dataset
>>> task_evaluator = evaluator("automatic-speech-recognition")
>>> data = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="validation[:40]")
>>> results = task_evaluator.compute(
>>>     model_or_pipeline="https://huggingface.co/openai/whisper-tiny.en",
>>>     data=data,
>>>     input_column="path",
>>>     label_column="sentence",
>>>     metric="wer",
>>> )