Datasets:
_id
string
| title
string
| text
string
|
---|---|---|
"1" | "" | "What is the step by step guide to invest in share market in india?" |
"2" | "" | "What is the step by step guide to invest in share market?" |
"3" | "" | "What is the story of Kohinoor (Koh-i-Noor) Diamond?" |
"4" | "" | "What would happen if the Indian government stole the Kohinoor (Koh-i-Noor) diamond back?" |
"5" | "" | "How can I increase the speed of my internet connection while using a VPN?" |
"6" | "" | "How can Internet speed be increased by hacking through DNS?" |
"7" | "" | "Why am I mentally very lonely? How can I solve it?" |
"8" | "" | "Find the remainder when [math]23^{24}[/math] is divided by 24,23?" |
"9" | "" | "Which one dissolve in water quikly sugar, salt, methane and carbon di oxide?" |
"10" | "" | "Which fish would survive in salt water?" |
"11" | "" | "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?" |
"12" | "" | "I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?" |
"13" | "" | "Should I buy tiago?" |
"14" | "" | "What keeps childern active and far from phone and video games?" |
"15" | "" | "How can I be a good geologist?" |
"16" | "" | "What should I do to be a great geologist?" |
"17" | "" | "When do you use シ instead of し?" |
"18" | "" | "When do you use "&" instead of "and"?" |
"19" | "" | "Motorola (company): Can I hack my Charter Motorolla DCX3400?" |
"20" | "" | "How do I hack Motorola DCX3400 for free internet?" |
"21" | "" | "Method to find separation of slits using fresnel biprism?" |
"22" | "" | "What are some of the things technicians can tell about the durability and reliability of Laptops and its components?" |
"23" | "" | "How do I read and find my YouTube comments?" |
"24" | "" | "How can I see all my Youtube comments?" |
"25" | "" | "What can make Physics easy to learn?" |
"26" | "" | "How can you make physics easy to learn?" |
"27" | "" | "What was your first sexual experience like?" |
"28" | "" | "What was your first sexual experience?" |
"29" | "" | "What are the laws to change your status from a student visa to a green card in the US, how do they compare to the immigration laws in Canada?" |
"30" | "" | "What are the laws to change your status from a student visa to a green card in the US? How do they compare to the immigration laws in Japan?" |
"31" | "" | "What would a Trump presidency mean for current international master’s students on an F1 visa?" |
"32" | "" | "How will a Trump presidency affect the students presently in US or planning to study in US?" |
"33" | "" | "What does manipulation mean?" |
"34" | "" | "What does manipulation means?" |
"35" | "" | "Why do girls want to be friends with the guy they reject?" |
"36" | "" | "How do guys feel after rejecting a girl?" |
"37" | "" | "Why are so many Quora users posting questions that are readily answered on Google?" |
"38" | "" | "Why do people ask Quora questions which can be answered easily by Google?" |
"39" | "" | "Which is the best digital marketing institution in banglore?" |
"40" | "" | "Which is the best digital marketing institute in Pune?" |
"41" | "" | "Why do rockets look white?" |
"42" | "" | "Why are rockets and boosters painted white?" |
"43" | "" | "What's causing someone to be jealous?" |
"44" | "" | "What can I do to avoid being jealous of someone?" |
"45" | "" | "What are the questions should not ask on Quora?" |
"47" | "" | "How much is 30 kV in HP?" |
"48" | "" | "Where can I find a conversion chart for CC to horsepower?" |
"49" | "" | "What does it mean that every time I look at the clock the numbers are the same?" |
"50" | "" | "How many times a day do a clock’s hands overlap?" |
"51" | "" | "What are some tips on making it through the job interview process at Medicines?" |
"52" | "" | "What are some tips on making it through the job interview process at Foundation Medicine?" |
"53" | "" | "What is web application?" |
"54" | "" | "What is the web application framework?" |
"55" | "" | "Does society place too much importance on sports?" |
"56" | "" | "How do sports contribute to the society?" |
"57" | "" | "What is best way to make money online?" |
"58" | "" | "What is best way to ask for money online?" |
"59" | "" | "How should I prepare for CA final law?" |
"60" | "" | "How one should know that he/she completely prepare for CA final exam?" |
"61" | "" | "What's one thing you would like to do better?" |
"62" | "" | "What's one thing you do despite knowing better?" |
"63" | "" | "What are some special cares for someone with a nose that gets stuffy during the night?" |
"64" | "" | "How can I keep my nose from getting stuffy at night?" |
"65" | "" | "What Game of Thrones villain would be the most likely to give you mercy?" |
"66" | "" | "What Game of Thrones villain would you most like to be at the mercy of?" |
"67" | "" | "Does the United States government still blacklist (employment, etc.) some United States citizens because their political views?" |
"68" | "" | "How is the average speed of gas molecules determined?" |
"69" | "" | "What is the best travel website in spain?" |
"70" | "" | "What is the best travel website?" |
"71" | "" | "Why do some people think Obama will try to take their guns away?" |
"72" | "" | "Has there been a gun control initiative to take away guns people already own?" |
"73" | "" | "I'm a 19-year-old. How can I improve my skills or what should I do to become an entrepreneur in the next few years?" |
"74" | "" | "I am a 19 year old guy. How can I become a billionaire in the next 10 years?" |
"75" | "" | "When a girlfriend asks her boyfriend "Why did you choose me? What makes you want to be with me?", what should one reply to her?" |
"76" | "" | "My girlfriend said that we should end this because she is confused about her feelings for me. I wished her well and disconnected. Should I call her and ask her if she wants to get back together?" |
"77" | "" | "How do we prepare for UPSC?" |
"78" | "" | "How do I prepare for civil service?" |
"79" | "" | "What is the stall speed and AOA of an f-14 with wings fully swept back?" |
"80" | "" | "Why did aircraft stop using variable-sweep wings, like those on an F-14?" |
"81" | "" | "Why do Slavs squat?" |
"82" | "" | "Will squats make my legs thicker?" |
"83" | "" | "When can I expect my Cognizant confirmation mail?" |
"84" | "" | "When can I expect Cognizant confirmation mail?" |
"85" | "" | "Can I make 50,000 a month by day trading?" |
"86" | "" | "Can I make 30,000 a month by day trading?" |
"87" | "" | "Is being a good kid and not being a rebel worth it in the long run?" |
"88" | "" | "Is being bored good for a kid?" |
"89" | "" | "What universities does Rexnord recruit new grads from? What majors are they looking for?" |
"90" | "" | "What universities does B&G Foods recruit new grads from? What majors are they looking for?" |
"91" | "" | "What is the quickest way to increase Instagram followers?" |
"92" | "" | "How can we increase our number of Instagram followers?" |
"93" | "" | "How did Darth Vader fought Darth Maul in Star Wars Legends?" |
"94" | "" | "Does Quora have a character limit for profile descriptions?" |
"95" | "" | "What are the stages of breaking up between couple? I mean, what happens after the breaking up emotionally whether its a male or female?" |
"96" | "" | "Who is affected more by a breakup, the boy or the girl?" |
"97" | "" | "What are some examples of products that can be make from crude oil?" |
"98" | "" | "What are some of the products made from crude oil?" |
"99" | "" | "How do I make friends." |
"100" | "" | "How to make friends ?" |
"101" | "" | "Is Career Launcher good for RBI Grade B preparation?" |
Dataset Card for BEIR Benchmark
Dataset Summary
BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:
- Fact-checking: FEVER, Climate-FEVER, SciFact
- Question-Answering: NQ, HotpotQA, FiQA-2018
- Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus
- News Retrieval: TREC-NEWS, Robust04
- Argument Retrieval: Touche-2020, ArguAna
- Duplicate Question Retrieval: Quora, CqaDupstack
- Citation-Prediction: SCIDOCS
- Tweet Retrieval: Signal-1M
- Entity Retrieval: DBPedia
All these datasets have been preprocessed and can be used for your experiments.
Supported Tasks and Leaderboards
The dataset supports a leaderboard that evaluates models against task-specific metrics such as F1 or EM, as well as their ability to retrieve supporting information from Wikipedia.
The current best performing models can be found here.
Languages
All tasks are in English (en
).
Dataset Structure
All BEIR datasets must contain a corpus, queries and qrels (relevance judgments file). They must be in the following format:
corpus
file: a.jsonl
file (jsonlines) that contains a list of dictionaries, each with three fields_id
with unique document identifier,title
with document title (optional) andtext
with document paragraph or passage. For example:{"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
queries
file: a.jsonl
file (jsonlines) that contains a list of dictionaries, each with two fields_id
with unique query identifier andtext
with query text. For example:{"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
qrels
file: a.tsv
file (tab-seperated) that contains three columns, i.e. thequery-id
,corpus-id
andscore
in this order. Keep 1st row as header. For example:q1 doc1 1
Data Instances
A high level example of any beir dataset:
corpus = {
"doc1" : {
"title": "Albert Einstein",
"text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
its influence on the philosophy of science. He is best known to the general public for his mass–energy \
equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
of the photoelectric effect', a pivotal step in the development of quantum theory."
},
"doc2" : {
"title": "", # Keep title an empty string if not present
"text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\
with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
},
}
queries = {
"q1" : "Who developed the mass-energy equivalence formula?",
"q2" : "Which beer is brewed with a large proportion of wheat?"
}
qrels = {
"q1" : {"doc1": 1},
"q2" : {"doc2": 1},
}
Data Fields
Examples from all configurations have the following features:
Corpus
corpus
: adict
feature representing the document title and passage text, made up of:_id
: astring
feature representing the unique document idtitle
: astring
feature, denoting the title of the document.text
: astring
feature, denoting the text of the document.
Queries
queries
: adict
feature representing the query, made up of:_id
: astring
feature representing the unique query idtext
: astring
feature, denoting the text of the query.
Qrels
qrels
: adict
feature representing the query document relevance judgements, made up of:_id
: astring
feature representing the query id_id
: astring
feature, denoting the document id.score
: aint32
feature, denoting the relevance judgement between query and document.
Data Splits
Dataset | Website | BEIR-Name | Type | Queries | Corpus | Rel D/Q | Down-load | md5 |
---|---|---|---|---|---|---|---|---|
MSMARCO | Homepage | msmarco |
train dev test |
6,980 | 8.84M | 1.1 | Link | 444067daf65d982533ea17ebd59501e4 |
TREC-COVID | Homepage | trec-covid |
test |
50 | 171K | 493.5 | Link | ce62140cb23feb9becf6270d0d1fe6d1 |
NFCorpus | Homepage | nfcorpus |
train dev test |
323 | 3.6K | 38.2 | Link | a89dba18a62ef92f7d323ec890a0d38d |
BioASQ | Homepage | bioasq |
train test |
500 | 14.91M | 8.05 | No | How to Reproduce? |
NQ | Homepage | nq |
train test |
3,452 | 2.68M | 1.2 | Link | d4d3d2e48787a744b6f6e691ff534307 |
HotpotQA | Homepage | hotpotqa |
train dev test |
7,405 | 5.23M | 2.0 | Link | f412724f78b0d91183a0e86805e16114 |
FiQA-2018 | Homepage | fiqa |
train dev test |
648 | 57K | 2.6 | Link | 17918ed23cd04fb15047f73e6c3bd9d9 |
Signal-1M(RT) | Homepage | signal1m |
test |
97 | 2.86M | 19.6 | No | How to Reproduce? |
TREC-NEWS | Homepage | trec-news |
test |
57 | 595K | 19.6 | No | How to Reproduce? |
ArguAna | Homepage | arguana |
test |
1,406 | 8.67K | 1.0 | Link | 8ad3e3c2a5867cdced806d6503f29b99 |
Touche-2020 | Homepage | webis-touche2020 |
test |
49 | 382K | 19.0 | Link | 46f650ba5a527fc69e0a6521c5a23563 |
CQADupstack | Homepage | cqadupstack |
test |
13,145 | 457K | 1.4 | Link | 4e41456d7df8ee7760a7f866133bda78 |
Quora | Homepage | quora |
dev test |
10,000 | 523K | 1.6 | Link | 18fb154900ba42a600f84b839c173167 |
DBPedia | Homepage | dbpedia-entity |
dev test |
400 | 4.63M | 38.2 | Link | c2a39eb420a3164af735795df012ac2c |
SCIDOCS | Homepage | scidocs |
test |
1,000 | 25K | 4.9 | Link | 38121350fc3a4d2f48850f6aff52e4a9 |
FEVER | Homepage | fever |
train dev test |
6,666 | 5.42M | 1.2 | Link | 5a818580227bfb4b35bb6fa46d9b6c03 |
Climate-FEVER | Homepage | climate-fever |
test |
1,535 | 5.42M | 3.0 | Link | 8b66f0a9126c521bae2bde127b4dc99d |
SciFact | Homepage | scifact |
train test |
300 | 5K | 1.1 | Link | 5f7d1de60b170fc8027bb7898e2efca1 |
Robust04 | Homepage | robust04 |
test |
249 | 528K | 69.9 | No | How to Reproduce? |
Dataset Creation
Curation Rationale
[Needs More Information]
Source Data
Initial Data Collection and Normalization
[Needs More Information]
Who are the source language producers?
[Needs More Information]
Annotations
Annotation process
[Needs More Information]
Who are the annotators?
[Needs More Information]
Personal and Sensitive Information
[Needs More Information]
Considerations for Using the Data
Social Impact of Dataset
[Needs More Information]
Discussion of Biases
[Needs More Information]
Other Known Limitations
[Needs More Information]
Additional Information
Dataset Curators
[Needs More Information]
Licensing Information
[Needs More Information]
Citation Information
Cite as:
@inproceedings{
thakur2021beir,
title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
year={2021},
url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}
Contributions
Thanks to @Nthakur20 for adding this dataset.
- Downloads last month
- 135