Datasets:

alpindale
/

visual-novels

Name: visual-novels
Creator: Alpin
License: https://choosealicense.com/licenses/apache-2.0/

Tasks:

Conversational

Text Generation

Languages: English

License: apache-2.0

Dataset card Files Files and versions Community

Dataset Viewer

Go to dataset viewer

Viewer

The dataset viewer is not available for this split.

Cannot load the dataset split (in streaming mode) to extract the first rows.

Error code:   StreamingRowsError
Exception:    ArrowInvalid
Message:      JSON parse error: Missing a name for object member. in row 0
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py", line 144, in _generate_tables
                  dataset = json.load(f)
                File "/usr/local/lib/python3.9/json/__init__.py", line 293, in load
                  return loads(fp.read(),
                File "/usr/local/lib/python3.9/json/__init__.py", line 346, in loads
                  return _default_decoder.decode(s)
                File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode
                  obj, end = self.raw_decode(s, idx=_w(s, 0).end())
                File "/usr/local/lib/python3.9/json/decoder.py", line 353, in raw_decode
                  obj, end = self.scan_once(s, idx)
              json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 23 column 1 (char 443)
              
              During handling of the above exception, another exception occurred:
              
              Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 257, in get_rows_or_raise
                  return get_rows(
                File "/src/services/worker/src/worker/utils.py", line 198, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 235, in get_rows
                  rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__
                  for key, example in ex_iterable:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 281, in __iter__
                  for key, pa_table in self.generate_tables_fn(**self.kwargs):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py", line 147, in _generate_tables
                  raise e
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/json/json.py", line 121, in _generate_tables
                  pa_table = paj.read_json(
                File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
                File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
                File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
              pyarrow.lib.ArrowInvalid: JSON parse error: Missing a name for object member. in row 0

Need help to make the dataset viewer work? Open a discussion for direct support.

Visual Novel Dataset

This dataset contains parsed Visual Novel scripts for training language models. The dataset consists of approximately 60 million tokens of parsed scripts.

Dataset Structure

The dataset follows a general structure for visual novel scripts:

Dialogue lines: Dialogue lines are formatted with the speaker's name followed by a colon, and the dialogue itself enclosed in quotes. For example:
```
John: "Hello, how are you?"
```
Actions and narration: Actions and narration within the Visual Novel scripts are often enclosed in asterisks, but it's important to note that not all visual novels follow this convention. Actions and narration provide descriptions of character movements, background settings, or other narrative elements.
```
*John looked around the room, searching for answers.*
```

visual-novels.txt: This file contains all the parsed VNs concatenated within a single plaintext file. Each entry is separated with this string:
```
[          - title - {visual-novel-title-1.txt}          ]
```
VNDB/: This directory contains .json files that contain VNDB IDs for the corresponding VN's characters. Does not include unparsed VNs.
Archives/visual-novels-parsed.tar.zst: This archive contains the parsed VNs but with each script in a separate text file (i.e. not concatenated).
Archives/visual-novels-unparsed.tar.zst: This archive contains all the unparsed VNs along with the original script for the currently parsed VNs.

Usage

You can utilize this dataset to train language models, particularly for tasks related to natural language processing and text generation. By leveraging the parsed visual novel scripts, you can train models to understand dialogue structures and generate coherent responses. Additionally, the inclusion of the unparsed scripts allows for further analysis and processing.

Contribution

This dataset was gathered and parsed by the PygmalionAI Data Processing Team. Listed below are the team members, sorted by contribution amount:

Suikamelon: Model Database - (2,787,704 ++ 672,473 --)
Alpin: Model Database - GitHub (1,170,985 ++ 345,120 --)
Spartan: GitHub (901,046 ++ 467,915 --)
Unlucky-AI GitHub (253,316 ++ 256 --)

Citation

If you use this dataset in your research or projects, please cite it appropriately.

Acknowledgements

This dataset is compiled and shared for research and educational purposes. The dataset includes parsed visual novel scripts from various sources, which are predominantly copyrighted and owned by their respective publishers and creators. The inclusion of these scripts in this dataset does not imply any endorsement or authorization from the copyright holders. We would like to express our sincere gratitude to the original copyright holders and creators of the visual novels for their valuable contributions to the art and storytelling. We respect and acknowledge their intellectual property rights. We strongly encourage users of this dataset to adhere to copyright laws and any applicable licensing restrictions when using or analyzing the provided content. It is the responsibility of the users to ensure that any use of the dataset complies with the legal requirements governing intellectual property and fair use. Please be aware that the creators and distributors of this dataset disclaim any liability or responsibility for any unauthorized or illegal use of the dataset by third parties. If you are a copyright holder or have any concerns about the content included in this dataset, please contact us at this email address to discuss the matter further and address any potential issues.

Downloads last month: 9

Edit dataset card

Evaluate models Model Database Leaderboard