Dataset Viewer
Viewer
The dataset viewer is not available for this split.
Cannot load the dataset split (in streaming mode) to extract the first rows.
Error code:   StreamingRowsError
Exception:    ValueError
Message:      Protocol not known: ['https
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/utils.py", line 257, in get_rows_or_raise
                  return get_rows(
                File "/src/services/worker/src/worker/utils.py", line 198, in decorator
                  return func(*args, **kwargs)
                File "/src/services/worker/src/worker/utils.py", line 235, in get_rows
                  rows_plus_one = list(itertools.islice(ds, rows_max_number + 1))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 1379, in __iter__
                  for key, example in ex_iterable:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/iterable_dataset.py", line 233, in __iter__
                  yield from self.generate_examples_fn(**self.kwargs)
                File "/tmp/modules-cache/datasets_modules/datasets/ai4bharat--kathbath/3baf116837b04bb852e9b4f24e45227491f87e34a6b53160283d519931104ae3/kathbath.py", line 161, in _generate_examples
                  for path in audio_files:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 840, in __iter__
                  yield from self.generator(*self.args, **self.kwargs)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 890, in _iter_from_urlpath
                  compression = _get_extraction_protocol(urlpath, download_config=download_config)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 391, in _get_extraction_protocol
                  with fsspec.open(urlpath, **(storage_options or {})) as f:
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 439, in open
                  return open_files(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 282, in open_files
                  fs, fs_token, paths = get_fs_token_paths(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 594, in get_fs_token_paths
                  chain = _un_chain(urlpath0, storage_options or {})
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/core.py", line 325, in _un_chain
                  cls = get_filesystem_class(protocol)
                File "/src/services/worker/.venv/lib/python3.9/site-packages/fsspec/registry.py", line 217, in get_filesystem_class
                  raise ValueError("Protocol not known: %s" % protocol)
              ValueError: Protocol not known: ['https

Need help to make the dataset viewer work? Open a discussion for direct support.

Dataset Card for Kathbath

Dataset Summary

Kathbath is an human-labeled ASR dataset containing 1,684 hours of labelled speech data across 12 Indian languages from 1,218 contributors located in 203 districts in India

Languages

  • Bengali
  • Gujarati
  • Kannada
  • Hindi
  • Malayalam
  • Marathi
  • Odia
  • Punjabi
  • Sanskrit
  • Tamil
  • Telugu
  • Urdu

Dataset Structure

Audio Data
data
β”œβ”€β”€ bengali
β”‚   β”œβ”€β”€ <split_name>
β”‚   β”‚   β”œβ”€β”€ 844424931537866-594-f.m4a
β”‚   β”‚   β”œβ”€β”€ 844424931029859-973-f.m4a
β”‚   β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ gujarati
β”œβ”€β”€ ...


Transcripts
data
β”œβ”€β”€ bengali
β”‚   β”œβ”€β”€ <split_name>
β”‚   β”‚   β”œβ”€β”€ transcription_n2w.txt
β”œβ”€β”€ gujarati
β”œβ”€β”€ ...

Licensing Information

The IndicSUPERB dataset is released under this licensing scheme:

  • We do not own any of the raw text used in creating this dataset.
  • The text data comes from the IndicCorp dataset which is a crawl of publicly available websites.
  • The audio transcriptions of the raw text and labelled annotations of the datasets have been created by us.
  • We license the actual packaging of all this data under the Creative Commons CC0 license (β€œno rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to the IndicSUPERB dataset.
  • This work is published from: India.

Citation Information

@misc{https://doi.org/10.48550/arxiv.2208.11761,
  doi = {10.48550/ARXIV.2208.11761},
  url = {https://arxiv.org/abs/2208.11761},
  author = {Javed, Tahir and Bhogale, Kaushal Santosh and Raman, Abhigyan and Kunchukuttan, Anoop and Kumar, Pratyush and Khapra, Mitesh M.},
  title = {IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Contributions

We would like to thank the Ministry of Electronics and Information Technology (MeitY) of the Government of India and the Centre for Development of Advanced Computing (C-DAC), Pune for generously supporting this work and providing us access to multiple GPU nodes on the Param Siddhi Supercomputer. We would like to thank the EkStep Foundation and Nilekani Philanthropies for their generous grant which went into hiring human resources as well as cloud resources needed for this work. We would like to thank DesiCrew for connecting us to native speakers for collecting data. We would like to thank Vivek Seshadri from Karya Inc. for helping setup the data collection infrastructure on the Karya platform. We would like to thank all the members of AI4Bharat team in helping create the Query by Example dataset.

Downloads last month
17
Edit dataset card
Evaluate models HF Leaderboard