You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Dolma

Name: dolma
Creator: Allen Institute for AI
License: https://choosealicense.com/licenses/other/

Dolma is a dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It is openly released under AI2’s ImpACT license as a medium risk artifact.

More information:

Read Dolma announcement blogpost on Medium;
Learn more about Dolma on its Data Sheet;
Review Dolma's ImpACT license for medium risk artifacts;
Explore the open source tools we created to curate Dolma.
Want to request removal of personal data? Use this form to notify us of documents containing PII about a specific user.

Summary Statistics

Source	Type	Gzip files (GB)	Documents (millions)	GPT-NeoX Tokens (billions)
CommonCrawl	web	4,197	4,600	2,415
C4	web	302	364	175
peS2o	academic	150	38.8	57
The Stack	code	675	236	430
Project Gutenberg	books	6.6	0.052	4.8
Wikipedia	encyclopedic	5.8	6.1	3.6
	Total	5,334	5,245	3,084

Datasets:

allenai
/

dolma

You need to agree to share your contact information to access this dataset

Dolma

Summary Statistics

Models trained or fine-tuned on allenai/dolma

Lilithchouy/xxxx

tb2022/test-model1

xiaoxiaoguai2/test

Akiffff/Asko

JamalSQ/PiddingAI

EarthMed/EarthMedCBDGummies