Dataset Viewer (First 5GB)
Viewer
The dataset viewer is not available for this split.
Job manager was killed while running this job (job exceeded maximum duration).
Error code:   JobManagerExceededMaximumDurationError

Need help to make the dataset viewer work? Open a discussion for direct support.

Getting Started

The dataset consists of 2084 jsonl files. You can download the dataset using HuggingFace:

from datasets import load_dataset
ds = load_dataset("togethercomputer/RedPajama-Data-1T")

Or you can directly download the files using the following command:

wget 'https://data.together.xyz/redpajama-data-1T/v1.0.0/urls.txt'
while read line; do
    dload_loc=${line#https://data.together.xyz/redpajama-data-1T/v1.0.0/}
    mkdir -p $(dirname $dload_loc)
    wget "$line" -O "$dload_loc"
done < urls.txt

After downloading the files, you can load the dataset from disk by setting the RED_PAJAMA_DATA_DIR environment variable to the directory containing the files:

import os
from datasets import load_dataset
os.environ["RED_PAJAMA_DATA_DIR"] = "/path/to/download"
ds = load_dataset("togethercomputer/RedPajama-Data-1T")

A smaller 1B-token sample of the dataset can be found here.

A full set of scripts to recreate the dataset from scratch can be found here.

Dataset Summary

RedPajama is a clean-room, fully open-source implementation of the LLaMa dataset.

Dataset Token Count
Commoncrawl 878 Billion
C4 175 Billion
GitHub 59 Billion
Books 26 Billion
ArXiv 28 Billion
Wikipedia 24 Billion
StackExchange 20 Billion
Total 1.2 Trillion

Languages

Primarily English, though the Wikipedia slice contains multiple languages.

Dataset Structure

The dataset structure is as follows:

{
    "text": ...,
    "meta": {"url": "...", "timestamp": "...", "source": "...", "language": "...", ...},
    "red_pajama_subset": "common_crawl" | "c4" | "github" | "books" | "arxiv" | "wikipedia" | "stackexchange"
}

Dataset Creation

This dataset was created to follow the LLaMa paper as closely as possible to try to reproduce its recipe.

Source Data

Commoncrawl

We download five dumps from Commoncrawl, and run the dumps through the official cc_net pipeline. We then deduplicate on the paragraph level, and filter out low quality text using a linear classifier trained to classify paragraphs as Wikipedia references or random Commoncrawl samples.

C4

C4 is downloaded from Huggingface. The only preprocessing step is to bring the data into our own format.

GitHub

The raw GitHub data is downloaded from Google BigQuery. We deduplicate on the file level and filter out low quality files and only keep projects that are distributed under the MIT, BSD, or Apache license.

Wikipedia

We use the Wikipedia dataset available on Huggingface, which is based on the Wikipedia dump from 2023-03-20 and contains text in 20 different languages. The dataset comes in preprocessed format, so that hyperlinks, comments and other formatting boilerplate has been removed.

Gutenberg and Books3

The PG19 subset of the Gutenberg Project and Books3 datasets are downloaded from Huggingface. After downloading, we use simhash to remove near duplicates.

ArXiv

ArXiv data is downloaded from Amazon S3 in the arxiv requester pays bucket. We only keep latex source files and remove preambles, comments, macros and bibliographies.

Stackexchange

The Stack Exchange split of the dataset is download from the Internet Archive. Here we only keep the posts from the 28 largest sites, remove html tags, group the posts into question-answer pairs, and order answers by their score.

SHA256 Checksums

SHA256 checksums for the dataset files for each data source are available here:

https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/arxiv_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/book_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/c4_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/common_crawl_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/github_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/stackexchange_SHA256SUMS.txt
https://data.together.xyz/redpajama-data-1T/v1.0.0/sha256/wikipedia_SHA256SUMS.txt

To cite RedPajama, please use:

@software{together2023redpajama,
  author = {Together Computer},
  title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
  month = April,
  year = 2023,
  url = {https://github.com/togethercomputer/RedPajama-Data}
}

License

Please refer to the licenses of the data subsets you use.

Downloads last month
42,237
Edit dataset card
Evaluate models HF Leaderboard

Models trained or fine-tuned on togethercomputer/RedPajama-Data-1T

Space using togethercomputer/RedPajama-Data-1T 1