Datasets:

BeIR
/

quora

Name: quora
Creator: BEIR
License: https://choosealicense.com/licenses/cc-by-sa-4.0/

Tasks:

Text Retrieval

Sub-tasks: entity-linking-retrieval fact-checking-retrieval

Languages: English

Multilinguality: monolingual

License: cc-by-sa-4.0

Dataset card Files Files and versions Community

Dataset Viewer

Auto-converted to Parquet

Go to dataset viewer

Viewer

_id string	title string	text string
"1"	""	"What is the step by step guide to invest in share market in india?"
"2"	""	"What is the step by step guide to invest in share market?"
"3"	""	"What is the story of Kohinoor (Koh-i-Noor) Diamond?"
"4"	""	"What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?"
"5"	""	"How can I increase the speed of my internet connection while using a VPN?"
"6"	""	"How can Internet speed be increased by hacking through DNS?"
"7"	""	"Why am I mentally very lonely? How can I solve it?"
"8"	""	"Find the remainder when [math]23^{24}[/math] is divided by 24,23?"
"9"	""	"Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?"
"10"	""	"Which fish would survive in salt water?"
"11"	""	"Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
"12"	""	"I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"
"13"	""	"Should I buy tiago?"
"14"	""	"What keeps childern active and far from phone and video games?"
"15"	""	"How can I be a good geologist?"
"16"	""	"What should I do to be a great geologist?"
"17"	""	"When do you use シ instead of し?"
"18"	""	"When do you use "&" instead of "and"?"
"19"	""	"Motorola (company): Can I hack my Charter Motorolla DCX3400?"
"20"	""	"How do I hack Motorola DCX3400 for free internet?"
"21"	""	"Method to find separation of slits using fresnel biprism?"
"22"	""	"What are some of the things technicians can tell about the durability and reliability of Laptops and its components?"
"23"	""	"How do I read and find my YouTube comments?"
"24"	""	"How can I see all my Youtube comments?"
"25"	""	"What can make Physics easy to learn?"
"26"	""	"How can you make physics easy to learn?"
"27"	""	"What was your first sexual experience like?"
"28"	""	"What was your first sexual experience?"
"29"	""	"What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?"
"30"	""	"What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?"
"31"	""	"What would a Trump presidency mean for current international master’s students on an F1 visa?"
"32"	""	"How will a Trump presidency affect the students presently in US or planning to study in US?"
"33"	""	"What does manipulation mean?"
"34"	""	"What does manipulation means?"
"35"	""	"Why do girls want to be friends with the guy they reject?"
"36"	""	"How do guys feel after rejecting a girl?"
"37"	""	"Why are so many Quora users posting questions that are readily answered on Google?"
"38"	""	"Why do people ask Quora questions which can be answered easily by Google?"
"39"	""	"Which is the best digital marketing institution in banglore?"
"40"	""	"Which is the best digital marketing institute in Pune?"
"41"	""	"Why do rockets look white?"
"42"	""	"Why are rockets and boosters painted white?"
"43"	""	"What's causing someone to be jealous?"
"44"	""	"What can I do to avoid being jealous of someone?"
"45"	""	"What are the questions should not ask on Quora?"
"47"	""	"How much is 30 kV in HP?"
"48"	""	"Where can I find a conversion chart for CC to horsepower?"
"49"	""	"What does it mean that every time I look at the clock the numbers are the same?"
"50"	""	"How many times a day do a clock’s hands overlap?"
"51"	""	"What are some tips on making it through the job interview process at Medicines?"
"52"	""	"What are some tips on making it through the job interview process at Foundation Medicine?"
"53"	""	"What is web application?"
"54"	""	"What is the web application framework?"
"55"	""	"Does society place too much importance on sports?"
"56"	""	"How do sports contribute to the society?"
"57"	""	"What is best way to make money online?"
"58"	""	"What is best way to ask for money online?"
"59"	""	"How should I prepare for CA final law?"
"60"	""	"How one should know that he/she completely prepare for CA final exam?"
"61"	""	"What's one thing you would like to do better?"
"62"	""	"What's one thing you do despite knowing better?"
"63"	""	"What are some special cares for someone with a nose that gets stuffy during the night?"
"64"	""	"How can I keep my nose from getting stuffy at night?"
"65"	""	"What Game of Thrones villain would be the most likely to give you mercy?"
"66"	""	"What Game of Thrones villain would you most like to be at the mercy of?"
"67"	""	"Does the United States government still blacklist (employment, etc.) some United States citizens because their political views?"
"68"	""	"How is the average speed of gas molecules determined?"
"69"	""	"What is the best travel website in spain?"
"70"	""	"What is the best travel website?"
"71"	""	"Why do some people think Obama will try to take their guns away?"
"72"	""	"Has there been a gun control initiative to take away guns people already own?"
"73"	""	"I'm a 19-year-old. How can I improve my skills or what should I do to become an entrepreneur in the next few years?"
"74"	""	"I am a 19 year old guy. How can I become a billionaire in the next 10 years?"
"75"	""	"When a girlfriend asks her boyfriend "Why did you choose me? What makes you want to be with me?", what should one reply to her?"
"76"	""	"My girlfriend said that we should end this because she is confused about her feelings for me. I wished her well and disconnected. Should I call her and ask her if she wants to get back together?"
"77"	""	"How do we prepare for UPSC?"
"78"	""	"How do I prepare for civil service?"
"79"	""	"What is the stall speed and AOA of an f-14 with wings fully swept back?"
"80"	""	"Why did aircraft stop using variable-sweep wings, like those on an F-14?"
"81"	""	"Why do Slavs squat?"
"82"	""	"Will squats make my legs thicker?"
"83"	""	"When can I expect my Cognizant confirmation mail?"
"84"	""	"When can I expect Cognizant confirmation mail?"
"85"	""	"Can I make 50,000 a month by day trading?"
"86"	""	"Can I make 30,000 a month by day trading?"
"87"	""	"Is being a good kid and not being a rebel worth it in the long run?"
"88"	""	"Is being bored good for a kid?"
"89"	""	"What universities does Rexnord recruit new grads from? What majors are they looking for?"
"90"	""	"What universities does B&G Foods recruit new grads from? What majors are they looking for?"
"91"	""	"What is the quickest way to increase Instagram followers?"
"92"	""	"How can we increase our number of Instagram followers?"
"93"	""	"How did Darth Vader fought Darth Maul in Star Wars Legends?"
"94"	""	"Does Quora have a character limit for profile descriptions?"
"95"	""	"What are the stages of breaking up between couple? I mean, what happens after the breaking up emotionally whether its a male or female?"
"96"	""	"Who is affected more by a breakup, the boy or the girl?"
"97"	""	"What are some examples of products that can be make from crude oil?"
"98"	""	"What are some of the products made from crude oil?"
"99"	""	"How do I make friends."
"100"	""	"How to make friends ?"
"101"	""	"Is Career Launcher good for RBI Grade B preparation?"

YAML Metadata Warning: The task_categories "zero-shot-retrieval" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, conversational, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, automatic-speech-recognition, audio-to-audio, audio-classification, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-retrieval, time-series-forecasting, text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, other

YAML Metadata Warning: The task_categories "information-retrieval" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, conversational, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, automatic-speech-recognition, audio-to-audio, audio-classification, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-retrieval, time-series-forecasting, text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, other

YAML Metadata Warning: The task_categories "zero-shot-information-retrieval" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, conversational, feature-extraction, text-generation, text2text-generation, fill-mask, sentence-similarity, text-to-speech, automatic-speech-recognition, audio-to-audio, audio-classification, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-retrieval, time-series-forecasting, text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, other

YAML Metadata Warning: The task_ids "passage-retrieval" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering

YAML Metadata Warning: The task_ids "tweet-retrieval" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering

YAML Metadata Warning: The task_ids "citation-prediction-retrieval" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering

YAML Metadata Warning: The task_ids "duplication-question-retrieval" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering

YAML Metadata Warning: The task_ids "argument-retrieval" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering

YAML Metadata Warning: The task_ids "news-retrieval" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering

YAML Metadata Warning: The task_ids "biomedical-information-retrieval" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering

YAML Metadata Warning: The task_ids "question-answering-retrieval" is not in the official list: acceptability-classification, entity-linking-classification, fact-checking, intent-classification, language-identification, multi-class-classification, multi-label-classification, multi-input-text-classification, natural-language-inference, semantic-similarity-classification, sentiment-classification, topic-classification, semantic-similarity-scoring, sentiment-scoring, sentiment-analysis, hate-speech-detection, text-scoring, named-entity-recognition, part-of-speech, parsing, lemmatization, word-sense-disambiguation, coreference-resolution, extractive-qa, open-domain-qa, closed-domain-qa, news-articles-summarization, news-articles-headline-generation, dialogue-generation, dialogue-modeling, language-modeling, text-simplification, explanation-generation, abstractive-qa, open-domain-abstractive-qa, closed-domain-qa, open-book-qa, closed-book-qa, slot-filling, masked-language-modeling, keyword-spotting, speaker-identification, audio-intent-classification, audio-emotion-recognition, audio-language-identification, multi-label-image-classification, multi-class-image-classification, face-detection, vehicle-detection, instance-segmentation, semantic-segmentation, panoptic-segmentation, image-captioning, grasping, task-planning, tabular-multi-class-classification, tabular-multi-label-classification, tabular-single-column-regression, rdf-to-text, multiple-choice-qa, multiple-choice-coreference-resolution, document-retrieval, utterance-retrieval, entity-linking-retrieval, fact-checking-retrieval, univariate-time-series-forecasting, multivariate-time-series-forecasting, visual-question-answering, document-question-answering

Dataset Card for BEIR Benchmark

Dataset Summary

BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

Fact-checking: FEVER, Climate-FEVER, SciFact
Question-Answering: NQ, HotpotQA, FiQA-2018
Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus
News Retrieval: TREC-NEWS, Robust04
Argument Retrieval: Touche-2020, ArguAna
Duplicate Question Retrieval: Quora, CqaDupstack
Citation-Prediction: SCIDOCS
Tweet Retrieval: Signal-1M
Entity Retrieval: DBPedia

All these datasets have been preprocessed and can be used for your experiments.

Supported Tasks and Leaderboards

The dataset supports a leaderboard that evaluates models against task-specific metrics such as F1 or EM, as well as their ability to retrieve supporting information from Wikipedia.

The current best performing models can be found here.

Languages

All tasks are in English (en).

Dataset Structure

All BEIR datasets must contain a corpus, queries and qrels (relevance judgments file). They must be in the following format:

corpus file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with three fields _id with unique document identifier, title with document title (optional) and text with document paragraph or passage. For example: {"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
queries file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with two fields _id with unique query identifier and text with query text. For example: {"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
qrels file: a .tsv file (tab-seperated) that contains three columns, i.e. the query-id, corpus-id and score in this order. Keep 1st row as header. For example: q1 doc1 1

Data Instances

A high level example of any beir dataset:

corpus = {
    "doc1" : {
        "title": "Albert Einstein", 
        "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
                 one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
                 its influence on the philosophy of science. He is best known to the general public for his massâ€“energy \
                 equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
                 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
                 of the photoelectric effect', a pivotal step in the development of quantum theory."
        },
    "doc2" : {
        "title": "", # Keep title an empty string if not present
        "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
                 malted barley. The two main varieties are German WeiÃŸbier and Belgian witbier; other types include Lambic (made\
                 with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
    },
}

queries = {
    "q1" : "Who developed the mass-energy equivalence formula?",
    "q2" : "Which beer is brewed with a large proportion of wheat?"
}

qrels = {
    "q1" : {"doc1": 1},
    "q2" : {"doc2": 1},
}

Data Fields

Examples from all configurations have the following features:

Corpus

corpus: a dict feature representing the document title and passage text, made up of:
- _id: a string feature representing the unique document id
  - title: a string feature, denoting the title of the document.
  - text: a string feature, denoting the text of the document.

Queries

queries: a dict feature representing the query, made up of:
- _id: a string feature representing the unique query id
- text: a string feature, denoting the text of the query.

Qrels

qrels: a dict feature representing the query document relevance judgements, made up of:
- _id: a string feature representing the query id
  - _id: a string feature, denoting the document id.
  - score: a int32 feature, denoting the relevance judgement between query and document.

Data Splits

Dataset	Website	BEIR-Name	Type	Queries	Corpus	Rel D/Q	Down-load	md5
MSMARCO	Homepage	`msmarco`	`train` `dev` `test`	6,980	8.84M	1.1	Link	`444067daf65d982533ea17ebd59501e4`
TREC-COVID	Homepage	`trec-covid`	`test`	50	171K	493.5	Link	`ce62140cb23feb9becf6270d0d1fe6d1`
NFCorpus	Homepage	`nfcorpus`	`train` `dev` `test`	323	3.6K	38.2	Link	`a89dba18a62ef92f7d323ec890a0d38d`
BioASQ	Homepage	`bioasq`	`train` `test`	500	14.91M	8.05	No	How to Reproduce?
NQ	Homepage	`nq`	`train` `test`	3,452	2.68M	1.2	Link	`d4d3d2e48787a744b6f6e691ff534307`
HotpotQA	Homepage	`hotpotqa`	`train` `dev` `test`	7,405	5.23M	2.0	Link	`f412724f78b0d91183a0e86805e16114`
FiQA-2018	Homepage	`fiqa`	`train` `dev` `test`	648	57K	2.6	Link	`17918ed23cd04fb15047f73e6c3bd9d9`
Signal-1M(RT)	Homepage	`signal1m`	`test`	97	2.86M	19.6	No	How to Reproduce?
TREC-NEWS	Homepage	`trec-news`	`test`	57	595K	19.6	No	How to Reproduce?
ArguAna	Homepage	`arguana`	`test`	1,406	8.67K	1.0	Link	`8ad3e3c2a5867cdced806d6503f29b99`
Touche-2020	Homepage	`webis-touche2020`	`test`	49	382K	19.0	Link	`46f650ba5a527fc69e0a6521c5a23563`
CQADupstack	Homepage	`cqadupstack`	`test`	13,145	457K	1.4	Link	`4e41456d7df8ee7760a7f866133bda78`
Quora	Homepage	`quora`	`dev` `test`	10,000	523K	1.6	Link	`18fb154900ba42a600f84b839c173167`
DBPedia	Homepage	`dbpedia-entity`	`dev` `test`	400	4.63M	38.2	Link	`c2a39eb420a3164af735795df012ac2c`
SCIDOCS	Homepage	`scidocs`	`test`	1,000	25K	4.9	Link	`38121350fc3a4d2f48850f6aff52e4a9`
FEVER	Homepage	`fever`	`train` `dev` `test`	6,666	5.42M	1.2	Link	`5a818580227bfb4b35bb6fa46d9b6c03`
Climate-FEVER	Homepage	`climate-fever`	`test`	1,535	5.42M	3.0	Link	`8b66f0a9126c521bae2bde127b4dc99d`
SciFact	Homepage	`scifact`	`train` `test`	300	5K	1.1	Link	`5f7d1de60b170fc8027bb7898e2efca1`
Robust04	Homepage	`robust04`	`test`	249	528K	69.9	No	How to Reproduce?

Dataset Creation

Curation Rationale

[Needs More Information]

Source Data

Initial Data Collection and Normalization

[Needs More Information]

Who are the source language producers?

[Needs More Information]

Annotations

Annotation process

[Needs More Information]

Who are the annotators?

[Needs More Information]

Personal and Sensitive Information

[Needs More Information]

Considerations for Using the Data

Social Impact of Dataset

[Needs More Information]

Discussion of Biases

[Needs More Information]

Other Known Limitations

[Needs More Information]

Additional Information

Dataset Curators

[Needs More Information]

Licensing Information

[Needs More Information]

Citation Information

Cite as:

@inproceedings{
thakur2021beir,
title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
year={2021},
url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}