Dataset Viewer
Viewer
The dataset viewer is not available for this split.
Cannot load the dataset split (in streaming mode) to extract the first rows.
Error code:   StreamingRowsError
Exception:    NonStreamableDatasetError
Message:      Streaming is not possible for this dataset because data host server doesn't support HTTP range requests. You can still load this dataset in non-streaming mode by passing `streaming=False` (default)
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 496, in xopen
                  file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 439, in open
                  return open_files(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 282, in open_files
                  fs, fs_token, paths = get_fs_token_paths(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 606, in get_fs_token_paths
                  fs = filesystem(protocol, **inkwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/registry.py", line 261, in filesystem
                  return cls(**storage_options)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 76, in __call__
                  obj = super().__call__(*args, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/zip.py", line 59, in __init__
                  self.zip = zipfile.ZipFile(
                File "/usr/local/lib/python3.9/zipfile.py", line 1266, in __init__
                  self._RealGetContents()
                File "/usr/local/lib/python3.9/zipfile.py", line 1329, in _RealGetContents
                  endrec = _EndRecData(fp)
                File "/usr/local/lib/python3.9/zipfile.py", line 263, in _EndRecData
                  fpin.seek(0, 2)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 747, in seek
                  raise ValueError("Cannot seek streaming HTTP file")
              ValueError: Cannot seek streaming HTTP file
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 257, in get_rows_or_raise
                  return get_rows(
                File "/src/services/worker/src/worker/utils.py", line 198, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 235, in get_rows
                  rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__
                  for key, example in ex_iterable:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 233, in __iter__
                  yield from self.generate_examples_fn(**self.kwargs)
                File "/tmp/modules-cache/datasets_modules/datasets/narrativeqa/daef7ccc51ec258bef464658d11751bb20f033da9b4c219fd84563b3a4af0422/narrativeqa.py", line 112, in _generate_examples
                  with open(os.path.join(repo_dir, "documents.csv"), encoding="utf-8") as f:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/streaming.py", line 74, in wrapper
                  return function(*args, download_config=download_config, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 499, in xopen
                  raise NonStreamableDatasetError(
              datasets.download.streaming_download_manager.NonStreamableDatasetError: Streaming is not possible for this dataset because data host server doesn't support HTTP range requests. You can still load this dataset in non-streaming mode by passing `streaming=False` (default)

Need help to make the dataset viewer work? Open a discussion for direct support.

Dataset Card for Narrative QA

Dataset Summary

NarrativeQA is an English-lanaguage dataset of stories and corresponding questions designed to test reading comprehension, especially on long documents.

Supported Tasks and Leaderboards

The dataset is used to test reading comprehension. There are 2 tasks proposed in the paper: "summaries only" and "stories only", depending on whether the human-generated summary or the full story text is used to answer the question.

Languages

English

Dataset Structure

Data Instances

A typical data point consists of a question and answer pair along with a summary/story which can be used to answer the question. Additional information such as the url, word count, wikipedia page, are also provided.

A typical example looks like this:

{
    "document": {
        "id": "23jncj2n3534563110",
        "kind": "movie",
        "url": "https://www.imsdb.com/Movie%20Scripts/Name%20of%20Movie.html",
        "file_size": 80473,
        "word_count": 41000,
        "start": "MOVIE screenplay by",
        "end": ". THE END",
        "summary": {
            "text": "Joe Bloggs begins his journey exploring...",
            "tokens": ["Joe", "Bloggs", "begins", "his", "journey", "exploring",...],
            "url": "http://en.wikipedia.org/wiki/Name_of_Movie",
            "title": "Name of Movie (film)"
        },
        "text": "MOVIE screenplay by John Doe\nSCENE 1..."
    },
    "question": {
        "text": "Where does Joe Bloggs live?",
        "tokens": ["Where", "does", "Joe", "Bloggs", "live", "?"],
    },
    "answers": [
        {"text": "At home", "tokens": ["At", "home"]},
        {"text": "His house", "tokens": ["His", "house"]}
    ]
}

Data Fields

  • document.id - Unique ID for the story.
  • document.kind - "movie" or "gutenberg" depending on the source of the story.
  • document.url - The URL where the story was downloaded from.
  • document.file_size - File size (in bytes) of the story.
  • document.word_count - Number of tokens in the story.
  • document.start - First 3 tokens of the story. Used for verifying the story hasn't been modified.
  • document.end - Last 3 tokens of the story. Used for verifying the story hasn't been modified.
  • document.summary.text - Text of the wikipedia summary of the story.
  • document.summary.tokens - Tokenized version of document.summary.text.
  • document.summary.url - Wikipedia URL of the summary.
  • document.summary.title - Wikipedia Title of the summary.
  • question - {"text":"...", "tokens":[...]} for the question about the story.
  • answers - List of {"text":"...", "tokens":[...]} for valid answers for the question.

Data Splits

The data is split into training, valiudation, and test sets based on story (i.e. the same story cannot appear in more than one split):

Train Valid Test
32747 3461 10557

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

Stories and movies scripts were downloaded from Project Gutenburg and a range of movie script repositories (mainly imsdb).

Who are the source language producers?

The language producers are authors of the stories and scripts as well as Amazon Turk workers for the questions.

Annotations

Annotation process

Amazon Turk Workers were provided with human written summaries of the stories (To make the annotation tractable and to lead annotators towards asking non-localized questions). Stories were matched with plot summaries from Wikipedia using titles and verified the matching with help from human annotators. The annotators were asked to determine if both the story and the summary refer to a movie or a book (as some books are made into movies), or if they are the same part in a series produced in the same year. Annotators on Amazon Mechanical Turk were instructed to write 10 question–answer pairs each based solely on a given summary. Annotators were instructed to imagine that they are writing questions to test students who have read the full stories but not the summaries. We required questions that are specific enough, given the length and complexity of the narratives, and to provide adiverse set of questions about characters, events, why this happened, and so on. Annotators were encouraged to use their own words and we prevented them from copying. We asked for answers that are grammatical, complete sentences, and explicitly allowed short answers (one word, or a few-word phrase, or ashort sentence) as we think that answering with a full sentence is frequently perceived as artificial when asking about factual information. Annotators were asked to avoid extra, unnecessary information in the question or the answer, and to avoid yes/no questions or questions about the author or the actors.

Who are the annotators?

Amazon Mechanical Turk workers.

Personal and Sensitive Information

None

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

The dataset is released under a Apache-2.0 License.

Citation Information

@article{narrativeqa,
author = {Tom\'a\v s Ko\v cisk\'y and Jonathan Schwarz and Phil Blunsom and
          Chris Dyer and Karl Moritz Hermann and G\'abor Melis and
          Edward Grefenstette},
title = {The {NarrativeQA} Reading Comprehension Challenge},
journal = {Transactions of the Association for Computational Linguistics},
url = {https://TBD},
volume = {TBD},
year = {2018},
pages = {TBD},
}

Contributions

Thanks to @ghomasHudson for adding this dataset.

Downloads last month
7,848

Models trained or fine-tuned on narrativeqa