Dataset Card for chinese_chatgpt_corpus

Dataset Summary

This repo collects chinese corpus for Supervised Finetuning (SFT) and Reinforcement Learning From Human Feedback (RLHF).

Supported Tasks and Leaderboards

More Information Needed

Languages

Chinese

Dataset Structure

Data Instances

train_data_external_v1.jsonl

Size of downloaded dataset files: 5.04 GB
Size of the generated dataset: 0 GB
Total amount of disk used: 5.04 GB

An example looks as follows:

{
    "prompt": "问题：有没有给未成年贷款的有的联系",
    "answers":
    [
        {
            "answer": "若通过招行办理，我行规定，贷款人年龄需年满18岁，且年龄加贷款年限不得超过70岁。如果您持有我行信用卡附属卡，可尝试办理预借现金。",
            "score": 1
        }
    ],
    "prefix": "回答："
}

dev_data_external_v1.jsonl

Size of downloaded dataset files: 9.55 MB
Size of the generated dataset: 0 MB
Total amount of disk used: 9.55 MB

An example looks as follows:

{
    "prompt": "初学纹发现1/2\"的管螺纹并不是1\"的一半。不知道其中的原因，请各位指点。",
    "answers":
    [
        {
            "answer": "管螺纹的名义尺寸是“管子”的孔（内）径，而管子的壁厚不是两倍。所以，1/2\"的管螺纹并不是1\"的一半，",
            "score": 1
        }
    ],
    "prefix": "回答："
}

Data Fields

The data fields are the same among all splits.

train_data_external_v1.jsonl

prompt: prompt, string
answers: list of answers
- answer: answer, string
- score: score of answer, int
prefix: prefix to the answer, string

dev_data_external_v1.jsonl

prompt: prompt, string
answers: list of answers
- answer: answer, string
- score: score of answer, int
prefix: prefix to the answer, string

Data Splits

name	train
train_data_external_v1.jsonl	5477982
dev_data_external_v1.jsonl	10000

Dataset Creation

Curation Rationale

Link to github: data_prepare

Source Data

Initial Data Collection and Normalization

百科
知道问答
对联
古文
古诗词
微博新闻评论

Who are the source language producers?

More Information Needed

Annotations

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Datasets:

sunzeyeah
/

chinese_chatgpt_corpus

Dataset Card for chinese_chatgpt_corpus

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

train_data_external_v1.jsonl

dev_data_external_v1.jsonl

Data Fields

train_data_external_v1.jsonl

dev_data_external_v1.jsonl

Data Splits

Dataset Creation

Curation Rationale

Source Data

Initial Data Collection and Normalization

Who are the source language producers?

Annotations

Annotation process

Who are the annotators?

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Models trained or fine-tuned on sunzeyeah/chinese_chatgpt_corpus

woniu03/chatAI