Datasets:
The dataset viewer is not available for this split.
Error code: StreamingRowsError Exception: FileNotFoundError Message: https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz Traceback: Traceback (most recent call last): File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 417, in _info await _file_info( File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 837, in _file_info r.raise_for_status() File "/src/services/worker/.venv/lib/python3.9/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status raise ClientResponseError( aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz') The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/src/services/worker/src/worker/utils.py", line 257, in get_rows_or_raise return get_rows( File "/src/services/worker/src/worker/utils.py", line 198, in decorator return func(*args, **kwargs) File "/src/services/worker/src/worker/utils.py", line 235, in get_rows rows_plus_one = list(itertools.islice(ds, rows_max_number + 1)) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__ for key, example in ex_iterable: File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 233, in __iter__ yield from self.generate_examples_fn(**self.kwargs) File "/tmp/modules-cache/datasets_modules/datasets/bookcorpusopen/98559c92eb612e150a676c5b5131f9f8f07d4cab88e7f3761fda266ad22ff2a7/bookcorpusopen.py", line 102, in _generate_examples for book_file_path, f in book_files: File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 840, in __iter__ yield from self.generator(*self.args, **self.kwargs) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 891, in _iter_from_urlpath with xopen(urlpath, "rb", download_config=download_config) as f: File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 496, in xopen file_obj = fsspec.open(file, mode=mode, *args, **kwargs).open() File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 134, in open return self.__enter__() File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 102, in __enter__ f = self.fs.open(self.path, mode=mode) File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/spec.py", line 1199, in open f = self._open( File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 356, in _open size = size or self.info(path, **kwargs)["size"] File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 115, in wrapper return sync(self.loop, func, *args, **kwargs) File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 100, in sync raise return_result File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 55, in _runner result[0] = await coro File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 430, in _info raise FileNotFoundError(url) from exc FileNotFoundError: https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for BookCorpusOpen
Dataset Summary
Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This version of bookcorpus has 17868 dataset items (books). Each item contains two fields: title and text. The title is the name of the book (just the file name) while text contains unprocessed book text. The bookcorpus has been prepared by Shawn Presser and is generously hosted by The-Eye. The-Eye is a non-profit, community driven platform dedicated to the archiving and long-term preservation of any and all data including but by no means limited to... websites, books, games, software, video, audio, other digital-obscura and ideas.
Supported Tasks and Leaderboards
Languages
Dataset Structure
Data Instances
plain_text
- Size of downloaded dataset files: 2.40 GB
- Size of the generated dataset: 6.64 GB
- Total amount of disk used: 9.05 GB
An example of 'train' looks as follows.
This example was too long and was cropped:
{
"text": "\"\\n\\nzONE\\n\\n## The end and the beginning\\n\\nby\\n\\nPhilip F. Blood\\n\\nSMASHWORDS EDITION\\n\\nVersion 3.55\\n\\nPUBLISHED BY:\\n\\nPhi...",
"title": "zone-the-end-and-the-beginning.epub.txt"
}
Data Fields
The data fields are the same among all splits.
plain_text
title
: astring
feature.text
: astring
feature.
Data Splits
name | train |
---|---|
plain_text | 17868 |
Dataset Creation
Curation Rationale
Source Data
Initial Data Collection and Normalization
Who are the source language producers?
Annotations
Annotation process
Who are the annotators?
Personal and Sensitive Information
Considerations for Using the Data
Social Impact of Dataset
Discussion of Biases
Other Known Limitations
Additional Information
Dataset Curators
Licensing Information
The books have been crawled from smashwords.com, see their terms of service for more information.
A data sheet for this dataset has also been created and published in Addressing "Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for BookCorpus
Citation Information
@InProceedings{Zhu_2015_ICCV,
title = {Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books},
author = {Zhu, Yukun and Kiros, Ryan and Zemel, Rich and Salakhutdinov, Ruslan and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {December},
year = {2015}
}
Contributions
Thanks to @vblagoje for adding this dataset.
- Downloads last month
- 543