You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Disclaimers and Terms

  • This dataset contains conversations that may be considered unsafe, offensive, or upsetting. It is not intended for training dialogue agents without applying appropriate filtering measures. We are not responsible for any outputs of the models trained on this dataset.
  • Statements or opinions made in this dataset do not reflect the views of researchers or institutions involved in the data collection effort.
  • Users of this data are responsible for ensuring its appropriate use, which includes abiding by any applicable laws and regulations.
  • Users of this data should adhere to the terms of use for a specific model when using its direct outputs.
  • Users of this data agree to not attempt to determine the identity of individuals in this dataset.

Log in or Sign Up to review the conditions and access this dataset content.

Chatbot Arena Conversations Dataset

This dataset contains 33K cleaned conversations with pairwise human preferences. It is collected from 13K unique IP addresses on the Chatbot Arena from April to June 2023. Each sample includes a question ID, two model names, their full conversation text in OpenAI API JSON format, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp.

To ensure the safe release of data, we have made our best efforts to remove all conversations that contain personally identifiable information (PII). User consent is obtained through the "Terms of use" section on the data collection website. In addition, we have included the OpenAI moderation API output to flag inappropriate conversations. However, we have chosen to keep unsafe conversations intact so that researchers can study the safety-related questions associated with LLM usage in real-world scenarios as well as the OpenAI moderation process. As an example, we included additional toxic tags that are generated by our own toxic tagger, which are trained by fine-tuning T5 and RoBERTa on manually labeled data.

Basic Statistics

Key Value
# Conversations 33,000
# Models 20
# Users 13,383
# Languages 96
Avg. # Turns per Sample 1.2
Avg. # Tokens per Prompt 52.3
Avg. # Tokens per Response 189.5

Uniqueness and Potential Usage

Compared to existing human preference datasets like Anthropic/hh-rlhf, and OpenAssistant/oasst1. This dataset

  • Contains the outputs of 20 LLMs including stronger LLMs such as GPT-4 and Claude-v1. It also contains many failure cases of these state-of-the-art models.
  • Contains unrestricted conversations from over 13K users in the wild.

We believe it will help the AI research community answer important questions around topics like:

  • Characteristics and distributions of real-world user prompts
  • Training instruction-following models
  • Improve and evaluate LLM evaluation methods
  • Model selection and request dispatching algorithms
  • AI safety and content moderation

Disclaimers and Terms

  • This dataset contains conversations that may be considered unsafe, offensive, or upsetting. It is not intended for training dialogue agents without applying appropriate filtering measures. We are not responsible for any outputs of the models trained on this dataset.
  • Statements or opinions made in this dataset do not reflect the views of researchers or institutions involved in the data collection effort.
  • Users of this data are responsible for ensuring its appropriate use, which includes abiding by any applicable laws and regulations.
  • Users of this data should adhere to the terms of use for a specific model when using its direct outputs.
  • Users of this data agree to not attempt to determine the identity of individuals in this dataset.

Visualization and Elo Rating Calculation

This Colab notebook provides some visualizations and shows how to compute Elo ratings with the dataset.

License

The user prompts are licensed under CC-BY-4.0, while the model outputs are licensed under CC-BY-NC-4.0.

Citation

@misc{zheng2023judging,
      title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena}, 
      author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
      year={2023},
      eprint={2306.05685},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
20,400

Models trained or fine-tuned on lmsys/chatbot_arena_conversations

Space using lmsys/chatbot_arena_conversations 1