Datasets:
The dataset viewer is not available for this split.
Error code: ResponseAlreadyComputedError
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for End-to-End NLG Challenge
Dataset Summary
The E2E dataset is used for training end-to-end, data-driven natural language generation systems in the restaurant domain, which is ten times bigger than existing, frequently used datasets in this area. The E2E dataset poses new challenges: (1) its human reference texts show more lexical richness and syntactic variation, including discourse phenomena; (2) generating from this set requires content selection. As such, learning from this dataset promises more natural, varied and less template-like system utterances.
E2E is released in the following paper where you can find more details and baseline results: https://arxiv.org/abs/1706.09254
Supported Tasks and Leaderboards
text2text-generation-other-meaning-representation-to-text
: The dataset can be used to train a model to generate descriptions in the restaurant domain from meaning representations, which consists in taking as input some data about a restaurant and generate a sentence in natural language that presents the different aspects of the data about the restaurant.. Success on this task is typically measured by achieving a high BLEU, NIST, METEOR, Rouge-L, CIDEr. The TGen model (Dusek and Jurcıcek, 2016a) was used a baseline, had the following scores:
BLEU | NIST | METEOR | ROUGE_L | CIDEr | |
---|---|---|---|---|---|
BASELINE | 0.6593 | 8.6094 | 0.4483 | 0.6850 | 2.2338 |
This task has an inactive leaderboard which can be found here and ranks models based on the metrics above.
Languages
The dataset is in english (en).
Dataset Structure
Data Instances
Example of one instance:
{'human_reference': 'The Vaults pub near Café Adriatic has a 5 star rating. Prices start at £30.',
'meaning_representation': 'name[The Vaults], eatType[pub], priceRange[more than £30], customer rating[5 out of 5], near[Café Adriatic]'}
Data Fields
human_reference
: string, the text is natural language that describes the different characteristics in the meaning representationmeaning_representation
: list of slots and values to generate a description from
Each MR consists of 3–8 attributes (slots), such as name, food or area, and their values.
Data Splits
The dataset is split into training, validation and testing sets (in a 76.5-8.5-15 ratio), keeping a similar distribution of MR and reference text lengths and ensuring that MRs in different sets are distinct.
train | validation | test | |
---|---|---|---|
N. Instances | 42061 | 4672 | 4693 |
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data
[More Information Needed]
Initial Data Collection and Normalization
The data was collected using the CrowdFlower platform and quality-controlled following Novikova et al. (2016).
Who are the source language producers?
[More Information Needed]
Annotations
Following Novikova et al. (2016), the E2E data was collected using pictures as stimuli, which was shown to elicit significantly more natural, more informative, and better phrased human references than textual MRs.
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
[More Information Needed]
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
[More Information Needed]
Additional Information
Dataset Curators
[More Information Needed]
Licensing Information
[More Information Needed]
Citation Information
@article{dusek.etal2020:csl,
title = {Evaluating the {{State}}-of-the-{{Art}} of {{End}}-to-{{End Natural Language Generation}}: {{The E2E NLG Challenge}}},
author = {Du{\v{s}}ek, Ond\v{r}ej and Novikova, Jekaterina and Rieser, Verena},
year = {2020},
month = jan,
volume = {59},
pages = {123--156},
doi = {10.1016/j.csl.2019.06.009},
archivePrefix = {arXiv},
eprint = {1901.11528},
eprinttype = {arxiv},
journal = {Computer Speech \& Language}
Contributions
Thanks to @lhoestq for adding this dataset.
- Downloads last month
- 9,371