Dataset Viewer
Viewer
The dataset viewer is not available for this split.
Cannot load the dataset split (in streaming mode) to extract the first rows.
Error code:   StreamingRowsError
Exception:    FileNotFoundError
Message:      https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 417, in _info
                  await _file_info(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 837, in _file_info
                  r.raise_for_status()
                File "/src/services/worker/.venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
                  raise ClientResponseError(
              aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz')
              
              The above exception was the direct cause of the following exception:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 257, in get_rows_or_raise
                  return get_rows(
                File "/src/services/worker/src/worker/utils.py", line 198, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 235, in get_rows
                  rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__
                  for key, example in ex_iterable:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 233, in __iter__
                  yield from self.generate_examples_fn(**self.kwargs)
                File "/tmp/modules-cache/datasets_modules/datasets/bookcorpusopen/98559c92eb612e150a676c5b5131f9f8f07d4cab88e7f3761fda266ad22ff2a7/bookcorpusopen.py", line 102, in _generate_examples
                  for book_file_path, f in book_files:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 840, in __iter__
                  yield from self.generator(*self.args, **self.kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 891, in _iter_from_urlpath
                  with xopen(urlpath, "rb", download_config=download_config) as f:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 496, in xopen
                  file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open()
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 134, in open
                  return self.__enter__()
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 102, in __enter__
                  f = self.fs.open(self.path, mode=mode)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 1199, in open
                  f = self._open(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 356, in _open
                  size = size or self.info(path, **kwargs)["size"]
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 115, in wrapper
                  return sync(self.loop, func, *args, **kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 100, in sync
                  raise return_result
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 55, in _runner
                  result[0] = await coro
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 430, in _info
                  raise FileNotFoundError(url) from exc
              FileNotFoundError: https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz

Need help to make the dataset viewer work? Open a discussion for direct support.

Dataset Card for BookCorpusOpen

Dataset Summary

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This version of bookcorpus has 17868 dataset items (books). Each item contains two fields: title and text. The title is the name of the book (just the file name) while text contains unprocessed book text. The bookcorpus has been prepared by Shawn Presser and is generously hosted by The-Eye. The-Eye is a non-profit, community driven platform dedicated to the archiving and long-term preservation of any and all data including but by no means limited to... websites, books, games, software, video, audio, other digital-obscura and ideas.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

plain_text

  • Size of downloaded dataset files: 2.40 GB
  • Size of the generated dataset: 6.64 GB
  • Total amount of disk used: 9.05 GB

An example of 'train' looks as follows.

This example was too long and was cropped:

{
    "text": "\"\\n\\nzONE\\n\\n## The end and the beginning\\n\\nby\\n\\nPhilip F. Blood\\n\\nSMASHWORDS EDITION\\n\\nVersion 3.55\\n\\nPUBLISHED BY:\\n\\nPhi...",
    "title": "zone-the-end-and-the-beginning.epub.txt"
}

Data Fields

The data fields are the same among all splits.

plain_text

  • title: a string feature.
  • text: a string feature.

Data Splits

name train
plain_text 17868

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

The books have been crawled from smashwords.com, see their terms of service for more information.

A data sheet for this dataset has also been created and published in Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus

Citation Information

@InProceedings{Zhu_2015_ICCV,
    title = {Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books},
    author = {Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja},
    booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
    month = {December},
    year = {2015}
}

Contributions

Thanks to @vblagoje for adding this dataset.

Downloads last month
543

Models trained or fine-tuned on bookcorpusopen