GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

Supported Tasks and Leaderboards

The leaderboard for the GLUE benchmark can be found at this address. It comprises the following tasks:

ax

A manually-curated evaluation dataset for fine-grained analysis of system performance on a broad range of linguistic phenomena. This dataset evaluates sentence understanding through Natural Language Inference (NLI) problems. Use a model trained on MulitNLI to produce predictions for this dataset.

cola

The Corpus of Linguistic Acceptability consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence.

mnli

The Multi-Genre Natural Language Inference Corpus is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports. The authors of the benchmark use the standard test set, for which they obtained private labels from the RTE authors, and evaluate on both the matched (in-domain) and mismatched (cross-domain) section. They also uses and recommend the SNLI corpus as 550k examples of auxiliary training data.

mnli_matched

The matched validation and test splits from MNLI. See the "mnli" BuilderConfig for additional information.

mnli_mismatched

The mismatched validation and test splits from MNLI. See the "mnli" BuilderConfig for additional information.

mrpc

The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

qnli

The Stanford Question Answering Dataset is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). The authors of the benchmark convert the task into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue.

qqp

The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

rte

The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges. The authors of the benchmark combined the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009). Examples are constructed based on news and Wikipedia text. The authors of the benchmark convert all datasets to a two-class split, where for three-class datasets they collapse neutral and contradiction into not entailment, for consistency.

sst2

The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. It uses the two-way (positive/negative) class split, with only sentence-level labels.

stsb

The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5.

wnli

The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. The examples are manually constructed to foil simple statistical methods: Each one is contingent on contextual information provided by a single word or phrase in the sentence. To convert the problem into sentence pair classification, the authors of the benchmark construct sentence pairs by replacing the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. They use a small evaluation set consisting of new examples derived from fiction books that was shared privately by the authors of the original corpus. While the included training set is balanced between two classes, the test set is imbalanced between them (65% not entailment). Also, due to a data quirk, the development set is adversarial: hypotheses are sometimes shared between training and development examples, so if a model memorizes the training examples, they will predict the wrong label on corresponding development set example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence between a model's score on this task and its score on the unconverted original task. The authors of the benchmark call converted dataset WNLI (Winograd NLI).

Languages

The language data in GLUE is in English (BCP-47 en)

Dataset Structure

Data Instances

ax

Size of downloaded dataset files: 0.22 MB
Size of the generated dataset: 0.24 MB
Total amount of disk used: 0.46 MB

An example of 'test' looks as follows.

{
  "premise": "The cat sat on the mat.",
  "hypothesis": "The cat did not sit on the mat.",
  "label": -1,
  "idx: 0
}

cola

Size of downloaded dataset files: 0.38 MB
Size of the generated dataset: 0.61 MB
Total amount of disk used: 0.99 MB

An example of 'train' looks as follows.

{
  "sentence": "Our friends won't buy this analysis, let alone the next one we propose.",
  "label": 1,
  "id": 0
}

mnli

Size of downloaded dataset files: 312.78 MB
Size of the generated dataset: 82.47 MB
Total amount of disk used: 395.26 MB

An example of 'train' looks as follows.

{
  "premise": "Conceptually cream skimming has two basic dimensions - product and geography.",
  "hypothesis": "Product and geography are what make cream skimming work.",
  "label": 1,
  "idx": 0
}

mnli_matched

Size of downloaded dataset files: 312.78 MB
Size of the generated dataset: 3.69 MB
Total amount of disk used: 316.48 MB

An example of 'test' looks as follows.

{
  "premise": "Hierbas, ans seco, ans dulce, and frigola are just a few names worth keeping a look-out for.",
  "hypothesis": "Hierbas is a name worth looking out for.",
  "label": -1,
  "idx": 0
}

mnli_mismatched

Size of downloaded dataset files: 312.78 MB
Size of the generated dataset: 3.91 MB
Total amount of disk used: 316.69 MB

An example of 'test' looks as follows.

{
  "premise": "What have you decided, what are you going to do?",
  "hypothesis": "So what's your decision?,
  "label": -1,
  "idx": 0
}

premise string	hypothesis string	label class label 1 classes	idx int32 0 1.1k
"The cat sat on the mat."	"The cat did not sit on the mat."	-1 (no label)	0
"The cat did not sit on the mat."	"The cat sat on the mat."	-1 (no label)	1
"When you've got no snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow."	"When you've got snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow."	-1 (no label)	2
"When you've got snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow."	"When you've got no snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow."	-1 (no label)	3
"Out of the box, Ouya supports media apps such as Twitch.tv and XBMC media player."	"Out of the box, Ouya doesn't support media apps such as Twitch.tv and XBMC media player."	-1 (no label)	4
"Out of the box, Ouya doesn't support media apps such as Twitch.tv and XBMC media player."	"Out of the box, Ouya supports media apps such as Twitch.tv and XBMC media player."	-1 (no label)	5
"Out of the box, Ouya supports media apps such as Twitch.tv and XBMC media player."	"Out of the box, Ouya supports Twitch.tv and XBMC media player."	-1 (no label)	6
"Out of the box, Ouya supports Twitch.tv and XBMC media player."	"Out of the box, Ouya supports media apps such as Twitch.tv and XBMC media player."	-1 (no label)	7
"Considering this definition, it is surprising to find frequent use of sarcastic language in opinionated user generated content."	"Considering this definition, it is not surprising to find frequent use of sarcastic language in opinionated user generated content."	-1 (no label)	8
"Considering this definition, it is not surprising to find frequent use of sarcastic language in opinionated user generated content."	"Considering this definition, it is surprising to find frequent use of sarcastic language in opinionated user generated content."	-1 (no label)	9
"The new gaming console is affordable."	"The new gaming console is unaffordable."	-1 (no label)	10
"The new gaming console is unaffordable."	"The new gaming console is affordable."	-1 (no label)	11
"Brexit is an irreversible decision, Sir Mike Rake, the chairman of WorldPay and ex-chairman of BT group, said as calls for a second EU referendum were sparked last week."	"Brexit is a reversible decision, Sir Mike Rake, the chairman of WorldPay and ex-chairman of BT group, said as calls for a second EU referendum were sparked last week."	-1 (no label)	12
"Brexit is a reversible decision, Sir Mike Rake, the chairman of WorldPay and ex-chairman of BT group, said as calls for a second EU referendum were sparked last week."	"Brexit is an irreversible decision, Sir Mike Rake, the chairman of WorldPay and ex-chairman of BT group, said as calls for a second EU referendum were sparked last week."	-1 (no label)	13
"We built our society on unclean energy."	"We built our society on clean energy."	-1 (no label)	14
"We built our society on clean energy."	"We built our society on unclean energy."	-1 (no label)	15
"Pursuing a strategy of nonviolent protest, Gandhi took the administration by surprise and won concessions from the authorities."	"Pursuing a strategy of violent protest, Gandhi took the administration by surprise and won concessions from the authorities."	-1 (no label)	16
"Pursuing a strategy of violent protest, Gandhi took the administration by surprise and won concessions from the authorities."	"Pursuing a strategy of nonviolent protest, Gandhi took the administration by surprise and won concessions from the authorities."	-1 (no label)	17
"Pursuing a strategy of nonviolent protest, Gandhi took the administration by surprise and won concessions from the authorities."	"Pursuing a strategy of protest, Gandhi took the administration by surprise and won concessions from the authorities."	-1 (no label)	18
"Pursuing a strategy of protest, Gandhi took the administration by surprise and won concessions from the authorities."	"Pursuing a strategy of nonviolent protest, Gandhi took the administration by surprise and won concessions from the authorities."	-1 (no label)	19
"And if both apply, they are essentially impossible."	"And if both apply, they are essentially possible."	-1 (no label)	20
"And if both apply, they are essentially possible."	"And if both apply, they are essentially impossible."	-1 (no label)	21
"Writing Java is not too different from programming with handcuffs."	"Writing Java is similar to programming with handcuffs."	-1 (no label)	22
"Writing Java is similar to programming with handcuffs."	"Writing Java is not too different from programming with handcuffs."	-1 (no label)	23
"The market is about to get harder, but not impossible to navigate."	"The market is about to get harder, but possible to navigate."	-1 (no label)	24
"The market is about to get harder, but possible to navigate."	"The market is about to get harder, but not impossible to navigate."	-1 (no label)	25
"Even after now finding out that it's animal feed, I won't ever stop being addicted to Flamin' Hot Cheetos."	"Even after now finding out that it's animal feed, I will never stop being addicted to Flamin' Hot Cheetos."	-1 (no label)	26
"Even after now finding out that it's animal feed, I will never stop being addicted to Flamin' Hot Cheetos."	"Even after now finding out that it's animal feed, I won't ever stop being addicted to Flamin' Hot Cheetos."	-1 (no label)	27
"He did not disagree with the party's position, but felt that if he resigned, his popularity with Indians would cease to stifle the party's membership."	"He agreed with the party's position, but felt that if he resigned, his popularity with Indians would cease to stifle the party's membership."	-1 (no label)	28
"He agreed with the party's position, but felt that if he resigned, his popularity with Indians would cease to stifle the party's membership."	"He did not disagree with the party's position, but felt that if he resigned, his popularity with Indians would cease to stifle the party's membership."	-1 (no label)	29
"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, a negative impact on the pipeline results would be expected."	"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, a negative impact on the pipeline results would not be unexpected."	-1 (no label)	30
"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, a negative impact on the pipeline results would not be unexpected."	"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, a negative impact on the pipeline results would be expected."	-1 (no label)	31
"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, a negative impact on the pipeline results would be expected."	"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, it would be expected to negatively impact the pipeline results."	-1 (no label)	32
"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, it would be expected to negatively impact the pipeline results."	"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, a negative impact on the pipeline results would be expected."	-1 (no label)	33
"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, a negative impact on the pipeline results would be expected."	"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, it would not be unexpected for it to negatively impact the pipeline results."	-1 (no label)	34
"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, it would not be unexpected for it to negatively impact the pipeline results."	"If the pipeline tokenization scheme does not correspond to the one that was used when a model was created, a negative impact on the pipeline results would be expected."	-1 (no label)	35
"The water is too hot."	"The water is too cold."	-1 (no label)	36
"The water is too cold."	"The water is too hot."	-1 (no label)	37
"Falcon Heavy is the largest rocket since NASA's Saturn V booster, which was used for the Moon missions in the 1970s."	"Falcon Heavy is the smallest rocket since NASA's Saturn V booster, which was used for the Moon missions in the 1970s."	-1 (no label)	38
"Falcon Heavy is the smallest rocket since NASA's Saturn V booster, which was used for the Moon missions in the 1970s."	"Falcon Heavy is the largest rocket since NASA's Saturn V booster, which was used for the Moon missions in the 1970s."	-1 (no label)	39

Dataset Card for GLUE

Dataset Summary

Supported Tasks and Leaderboards

ax

cola

mnli

mnli_matched

mnli_mismatched

mrpc

qnli

qqp

rte

sst2

stsb

wnli

Languages

Dataset Structure

Data Instances

ax

cola

mnli

mnli_matched

mnli_mismatched

mrpc

qnli

qqp

rte

sst2

stsb

wnli

Data Fields

ax

cola

mnli

mnli_matched

mnli_mismatched

mrpc

qnli

qqp

rte

sst2

stsb

wnli

Data Splits

ax

cola

mnli

mnli_matched

mnli_mismatched

mrpc

qnli

qqp

rte

sst2

stsb

wnli

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

Models trained or fine-tuned on glue

Spaces using glue 14