You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Access to this dataset is automatically granted upon accepting the AI2 ImpACT License - Medium Risk Artifacts (“MR Agreement”) and completing all fields below.

Log in or Sign Up to review the conditions and access this dataset content.

Dolma

Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background.

Dolma is a dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It is openly released under AI2’s ImpACT license as a medium risk artifact.

More information:

Summary Statistics

Source Type Gzip files (GB) Documents (millions) GPT-NeoX Tokens (billions)
CommonCrawl web 4,197 4,600 2,415
C4 web 302 364 175
peS2o academic 150 38.8 57
The Stack code 675 236 430
Project Gutenberg books 6.6 0.052 4.8
Wikipedia encyclopedic 5.8 6.1 3.6
Total 5,334 5,245 3,084
Downloads last month
25,868

Models trained or fine-tuned on allenai/dolma