Datasets:

Sub-tasks: extractive-qa
Languages: Chinese
Multilinguality: monolingual
Size Categories: 10K<n<100K
Language Creators: crowdsourced
Annotations Creators: crowdsourced
Source Datasets: original
Dataset Viewer
Viewer
The dataset viewer is not available for this split.
Cannot load the dataset split (in streaming mode) to extract the first rows.
Error code:   StreamingRowsError
Exception:    ValueError
Message:      Cannot seek streaming HTTP file
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 264, in get_rows_or_raise
                  return get_rows(
                File "/src/services/worker/src/worker/utils.py", line 205, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 227, in get_rows
                  ds = load_dataset(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 2146, in load_dataset
                  return builder_instance.as_streaming_dataset(split=split)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1329, in as_streaming_dataset
                  splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
                File "/tmp/modules-cache/datasets_modules/datasets/cmrc2018/3cbb788a586e4597f67937944006349cd758baef9409fb90a6ddb85c1c84690c/cmrc2018.py", line 92, in _split_generators
                  downloaded_files = dl_manager.download_and_extract(urls_to_download)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1063, in download_and_extract
                  return self.extract(self.download(url_or_urls))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1015, in extract
                  urlpaths = map_nested(self._extract, url_or_urls, map_tuple=True)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 464, in map_nested
                  mapped = [
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 465, in <listcomp>
                  _single_map_nested((function, obj, types, None, True, None))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 367, in _single_map_nested
                  return function(data_struct)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1020, in _extract
                  protocol = _get_extraction_protocol(urlpath, download_config=self.download_config)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 392, in _get_extraction_protocol
                  return _get_extraction_protocol_with_magic_number(f)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 366, in _get_extraction_protocol_with_magic_number
                  f.seek(0)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 747, in seek
                  raise ValueError("Cannot seek streaming HTTP file")
              ValueError: Cannot seek streaming HTTP file

Need help to make the dataset viewer work? Open a discussion for direct support.

Dataset Card for "cmrc2018"

Dataset Summary

A Span-Extraction dataset for Chinese machine reading comprehension to add language diversities in this area. The dataset is composed by near 20,000 real questions annotated on Wikipedia paragraphs by human experts. We also annotated a challenge set which contains the questions that need comprehensive understanding and multi-sentence inference throughout the context.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

default

  • Size of downloaded dataset files: 11.50 MB
  • Size of the generated dataset: 22.31 MB
  • Total amount of disk used: 33.83 MB

An example of 'validation' looks as follows.

This example was too long and was cropped:

{
    "answers": {
        "answer_start": [11, 11],
        "text": ["光荣和ω-force", "光荣和ω-force"]
    },
    "context": "\"《战国无双3》()是由光荣和ω-force开发的战国无双系列的正统第三续作。本作以三大故事为主轴,分别是以武田信玄等人为主的《关东三国志》,织田信长等人为主的《战国三杰》,石田三成等人为主的《关原的年轻武者》,丰富游戏内的剧情。此部份专门介绍角色,欲知武...",
    "id": "DEV_0_QUERY_0",
    "question": "《战国无双3》是由哪两个公司合作开发的?"
}

Data Fields

The data fields are the same among all splits.

default

  • id: a string feature.
  • context: a string feature.
  • question: a string feature.
  • answers: a dictionary feature containing:
    • text: a string feature.
    • answer_start: a int32 feature.

Data Splits

name train validation test
default 10142 3219 1002

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

More Information Needed

Citation Information

@inproceedings{cui-emnlp2019-cmrc2018,
    title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
    author = "Cui, Yiming  and
      Liu, Ting  and
      Che, Wanxiang  and
      Xiao, Li  and
      Chen, Zhipeng  and
      Ma, Wentao  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1600",
    doi = "10.18653/v1/D19-1600",
    pages = "5886--5891",
}

Contributions

Thanks to @patrickvonplaten, @mariamabarham, @lewtun, @thomwolf for adding this dataset.

Downloads last month
1,461

Models trained or fine-tuned on cmrc2018