The Dataset Viewer has been disabled on this dataset.

Dataset Card: PIPPA-ShareGPT

This is a conversion of PygmalionAI's PIPPA deduped dataset to ShareGPT format for finetuning with Axolotl.

The reformat was completed via the following TypeScript project called ShareGPT-Reformat.

Files and explanations

  • pippa_sharegpt_raw.jsonl: The raw deduped dataset file converted to shareGPT. Roles will be defaulted to your finetuning software.
  • pippa_sharegpt.jsonl: A shareGPT dataset with the roles as USER: and CHARACTER: for finetuning with axolotl
  • pippa_sharegpt_trimmed.jsonl: A shareGPT dataset that has trimmed newlines, randomized system prompts, removes empty messages, and removes examples without a character description. Roles are USER and CHARACTER.

The best file to use is pippa_sharegpt_trimmed.jsonl if you want a finetune without bugs or inconsistencies. The best dataset to modify is either the original PIPPA deduped dataset with the ShareGPT reformat project or pippa_sharegpt.jsonl.

Required Axolotl patches

To make this dataset usable in its entirety, some axolotl patches are needed:

  • This patch allows the ability to use custom system prompts with ShareGPT format.
  • This patch allows for custom roles for the USER and ASSISTANT and allows for GPT prompts to come before human ones without cutoff.

You WILL experience unideal results with base axolotl at the time of publishing this README.

Citations

Paper for the original dataset:

@misc{gosling2023pippa,
      title={PIPPA: A Partially Synthetic Conversational Dataset}, 
      author={Tear Gosling and Alpin Dale and Yinhe Zheng},
      year={2023},
      eprint={2308.05884},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
14