Dataset Viewer
Viewer
The dataset viewer is not available for this split.
Job manager crashed while running this job (missing heartbeats).
Error code:   JobManagerCrashedError

Need help to make the dataset viewer work? Open a discussion for direct support.

Dataset Card for chinese_chatgpt_corpus

Dataset Summary

This repo collects chinese corpus for Supervised Finetuning (SFT) and Reinforcement Learning From Human Feedback (RLHF).

Supported Tasks and Leaderboards

More Information Needed

Languages

Chinese

Dataset Structure

Data Instances

train_data_external_v1.jsonl

  • Size of downloaded dataset files: 5.04 GB
  • Size of the generated dataset: 0 GB
  • Total amount of disk used: 5.04 GB

An example looks as follows:

{
    "prompt": "问题:有没有给未成年贷款的有的联系",
    "answers":
    [
        {
            "answer": "若通过招行办理,我行规定,贷款人年龄需年满18岁,且年龄加贷款年限不得超过70岁。如果您持有我行信用卡附属卡,可尝试办理预借现金。",
            "score": 1
        }
    ],
    "prefix": "回答:"
}

dev_data_external_v1.jsonl

  • Size of downloaded dataset files: 9.55 MB
  • Size of the generated dataset: 0 MB
  • Total amount of disk used: 9.55 MB

An example looks as follows:

{
    "prompt": "初学纹发现1/2\"的管螺纹并不是1\"的一半。不知道其中的原因,请各位指点。",
    "answers":
    [
        {
            "answer": "管螺纹的名义尺寸是“管子”的孔(内)径,而管子的壁厚不是两倍。所以,1/2\"的管螺纹并不是1\"的一半,",
            "score": 1
        }
    ],
    "prefix": "回答:"
}

Data Fields

The data fields are the same among all splits.

train_data_external_v1.jsonl

  • prompt: prompt, string
  • answers: list of answers
    • answer: answer, string
    • score: score of answer, int
  • prefix: prefix to the answer, string

dev_data_external_v1.jsonl

  • prompt: prompt, string
  • answers: list of answers
    • answer: answer, string
    • score: score of answer, int
  • prefix: prefix to the answer, string

Data Splits

name train
train_data_external_v1.jsonl 5477982
dev_data_external_v1.jsonl 10000

Dataset Creation

Curation Rationale

Link to github: data_prepare

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Downloads last month
9
Edit dataset card
Evaluate models HF Leaderboard

Models trained or fine-tuned on sunzeyeah/chinese_chatgpt_corpus