The dataset viewer is not available for this split.
Error code: StreamingRowsError Exception: ArrowInvalid Message: JSON parse error: Missing a name for object member. in row 0 Traceback: Traceback (most recent call last): File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables dataset = json.load(f) File "/usr/local/lib/python3.9/json/__init__.py", line 293, in load return loads(fp.read(), File "/usr/local/lib/python3.9/json/__init__.py", line 346, in loads return _default_decoder.decode(s) File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/lib/python3.9/json/decoder.py", line 353, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 23 column 1 (char 443) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/src/services/worker/src/worker/utils.py", line 257, in get_rows_or_raise return get_rows( File "/src/services/worker/src/worker/utils.py", line 198, in decorator return func(*args, **kwargs) File "/src/services/worker/src/worker/utils.py", line 235, in get_rows rows_plus_one = list(itertools.islice(ds, rows_max_number + 1)) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__ for key, example in ex_iterable: File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 281, in __iter__ for key, pa_table in self.generate_tables_fn(**self.kwargs): File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py", line 147, in _generate_tables raise e File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables pa_table = paj.read_json( File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: JSON parse error: Missing a name for object member. in row 0
Need help to make the dataset viewer work? Open a discussion for direct support.
Visual Novel Dataset
This dataset contains parsed Visual Novel scripts for training language models. The dataset consists of approximately 60 million tokens of parsed scripts.
Dataset Structure
The dataset follows a general structure for visual novel scripts:
Dialogue lines: Dialogue lines are formatted with the speaker's name followed by a colon, and the dialogue itself enclosed in quotes. For example:
John: "Hello, how are you?"
Actions and narration: Actions and narration within the Visual Novel scripts are often enclosed in asterisks, but it's important to note that not all visual novels follow this convention. Actions and narration provide descriptions of character movements, background settings, or other narrative elements.
*John looked around the room, searching for answers.*
Contents
visual-novels.txt
: This file contains all the parsed VNs concatenated within a single plaintext file. Each entry is separated with this string:[ - title - {visual-novel-title-1.txt} ]
VNDB/
: This directory contains.json
files that contain VNDB IDs for the corresponding VN's characters. Does not include unparsed VNs.Archives/visual-novels-parsed.tar.zst
: This archive contains the parsed VNs but with each script in a separate text file (i.e. not concatenated).Archives/visual-novels-unparsed.tar.zst
: This archive contains all the unparsed VNs along with the original script for the currently parsed VNs.
Usage
You can utilize this dataset to train language models, particularly for tasks related to natural language processing and text generation. By leveraging the parsed visual novel scripts, you can train models to understand dialogue structures and generate coherent responses. Additionally, the inclusion of the unparsed scripts allows for further analysis and processing.
Contribution
This dataset was gathered and parsed by the PygmalionAI Data Processing Team. Listed below are the team members, sorted by contribution amount:
- Suikamelon: HuggingFace - (2,787,704 ++ 672,473 --)
- Alpin: HuggingFace - GitHub (1,170,985 ++ 345,120 --)
- Spartan: GitHub (901,046 ++ 467,915 --)
- Unlucky-AI GitHub (253,316 ++ 256 --)
Citation
If you use this dataset in your research or projects, please cite it appropriately.
Acknowledgements
This dataset is compiled and shared for research and educational purposes. The dataset includes parsed visual novel scripts from various sources, which are predominantly copyrighted and owned by their respective publishers and creators. The inclusion of these scripts in this dataset does not imply any endorsement or authorization from the copyright holders. We would like to express our sincere gratitude to the original copyright holders and creators of the visual novels for their valuable contributions to the art and storytelling. We respect and acknowledge their intellectual property rights. We strongly encourage users of this dataset to adhere to copyright laws and any applicable licensing restrictions when using or analyzing the provided content. It is the responsibility of the users to ensure that any use of the dataset complies with the legal requirements governing intellectual property and fair use. Please be aware that the creators and distributors of this dataset disclaim any liability or responsibility for any unauthorized or illegal use of the dataset by third parties. If you are a copyright holder or have any concerns about the content included in this dataset, please contact us at this email address to discuss the matter further and address any potential issues.
- Downloads last month
- 9