The dataset viewer is not available for this dataset.
Cannot get the config names for the dataset.
Error code:   ConfigNamesError
Exception:    FileNotFoundError
Message:      Couldn't find a dataset script at /src/services/worker/hf-audio/esb-datasets-test-only/esb-datasets-test-only.py or any data file in the same directory. Couldn't find 'hf-audio/esb-datasets-test-only' on the Model Database Hub either: FileNotFoundError: No (supported) data files or dataset script found in hf-audio/esb-datasets-test-only. 
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/dataset/config_names.py", line 55, in compute_config_names_response
                  for config in sorted(get_dataset_config_names(path=dataset, token=hf_token))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 351, in get_dataset_config_names
                  dataset_module = dataset_module_factory(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1508, in dataset_module_factory
                  raise FileNotFoundError(
              FileNotFoundError: Couldn't find a dataset script at /src/services/worker/hf-audio/esb-datasets-test-only/esb-datasets-test-only.py or any data file in the same directory. Couldn't find 'hf-audio/esb-datasets-test-only' on the Model Database Hub either: FileNotFoundError: No (supported) data files or dataset script found in hf-audio/esb-datasets-test-only.

Need help to make the dataset viewer work? Open a discussion for direct support.

All eight of datasets in ESB can be downloaded and prepared in just a single line of code through the Model Database Datasets library:

from datasets import load_dataset

librispeech = load_dataset("esb/datasets", "librispeech", split="train")
  • "esb/datasets": the repository namespace. This is fixed for all ESB datasets.

  • "librispeech": the dataset name. This can be changed to any of any one of the eight datasets in ESB to download that dataset.

  • split="train": the split. Set this to one of train/validation/test to generate a specific split. Omit the split argument to generate all splits for a dataset.

The datasets are full prepared, such that the audio and transcription files can be used directly in training/evaluation scripts.

Dataset Information

A data point can be accessed by indexing the dataset object loaded through load_dataset:

print(librispeech[0])

A typical data point comprises the path to the audio file and its transcription. Also included is information of the dataset from which the sample derives and a unique identifier name:

{
  'dataset': 'librispeech', 
  'audio': {'path': '/home/sanchit-gandhi/.cache/huggingface/datasets/downloads/extracted/d2da1969fe9e7d06661b5dc370cf2e3c119a14c35950045bcb76243b264e4f01/374-180298-0000.flac',
      'array': array([ 7.01904297e-04,  7.32421875e-04,  7.32421875e-04, ...,
             -2.74658203e-04, -1.83105469e-04, -3.05175781e-05]),
      'sampling_rate': 16000},
    'text': 'chapter sixteen i might have told you of the beginning of this liaison in a few lines but i wanted you to see every step by which we came i to agree to whatever marguerite wished',
    'id': '374-180298-0000'
}

Data Fields

  • dataset: name of the ESB dataset from which the sample is taken.

  • audio: a dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.

  • text: the transcription of the audio file.

  • id: unique id of the data sample.

Data Preparation

Audio

The audio for all ESB datasets is segmented into sample lengths suitable for training ASR systems. The Model Database datasets library decodes audio files on the fly, reading the segments and converting them to a Python arrays. Consequently, no further preparation of the audio is required to be used in training/evaluation scripts.

Note that when accessing the audio column: dataset[0]["audio"] the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio" column, i.e. dataset[0]["audio"] should always be preferred over dataset["audio"][0].

Transcriptions

The transcriptions corresponding to each audio file are provided in their 'error corrected' format. No transcription pre-processing is applied to the text, only necessary 'error correction' steps such as removing junk tokens (<unk>) or converting symbolic punctuation to spelled out form (<comma> to ,). As such, no further preparation of the transcriptions is required to be used in training/evaluation scripts.

Transcriptions are provided for training and validation splits. The transcriptions are not provided for the test splits. ESB requires you to generate predictions for the test sets and upload them to https://huggingface.co/spaces/esb/leaderboard for scoring.

Access

All eight of the datasets in ESB are accessible and licensing is freely available. Three of the ESB datasets have specific terms of usage that must be agreed to before using the data. To do so, fill in the access forms on the specific datasets' pages:

Diagnostic Dataset

ESB contains a small, 8h diagnostic dataset of in-domain validation data with newly annotated transcriptions. The audio data is sampled from each of the ESB validation sets, giving a range of different domains and speaking styles. The transcriptions are annotated according to a consistent style guide with two formats: normalised and un-normalised. The dataset is structured in the same way as the ESB dataset, by grouping audio-transcription samples according to the dataset from which they were taken. We encourage participants to use this dataset when evaluating their systems to quickly assess performance on a range of different speech recognition conditions. For more information, visit: esb/diagnostic-dataset.

Summary of ESB Datasets

Dataset Domain Speaking Style Train (h) Dev (h) Test (h) Transcriptions License
LibriSpeech Audiobook Narrated 960 11 11 Normalised CC-BY-4.0
Common Voice Wikipedia Narrated 1409 27 27 Punctuated & Cased CC0-1.0
Voxpopuli European Parliament Oratory 523 5 5 Punctuated CC0
TED-LIUM TED talks Oratory 454 2 3 Normalised CC-BY-NC-ND 3.0
GigaSpeech Audiobook, podcast, YouTube Narrated, spontaneous 2500 12 40 Punctuated apache-2.0
SPGISpeech Fincancial meetings Oratory, spontaneous 4900 100 100 Punctuated & Cased User Agreement
Earnings-22 Fincancial meetings Oratory, spontaneous 105 5 5 Punctuated & Cased CC-BY-SA-4.0
AMI Meetings Spontaneous 78 9 9 Punctuated & Cased CC-BY-4.0

LibriSpeech

The LibriSpeech corpus is a standard large-scale corpus for assessing ASR systems. It consists of approximately 1,000 hours of narrated audiobooks from the LibriVox project. It is licensed under CC-BY-4.0.

Example Usage:

librispeech = load_dataset("esb/datasets", "librispeech")

Train/validation splits:

  • train (combination of train.clean.100, train.clean.360 and train.other.500)
  • validation.clean
  • validation.other

Test splits:

  • test.clean
  • test.other

Also available are subsets of the train split, which can be accessed by setting the subconfig argument:

librispeech = load_dataset("esb/datasets", "librispeech", subconfig="clean.100")
  • clean.100: 100 hours of training data from the 'clean' subset
  • clean.360: 360 hours of training data from the 'clean' subset
  • other.500: 500 hours of training data from the 'other' subset

Common Voice

Common Voice is a series of crowd-sourced open-licensed speech datasets where speakers record text from Wikipedia in various languages. The speakers are of various nationalities and native languages, with different accents and recording conditions. We use the English subset of version 9.0 (27-4-2022), with approximately 1,400 hours of audio-transcription data. It is licensed under CC0-1.0.

Example usage:

common_voice = load_dataset("esb/datasets", "common_voice", use_auth_token=True)

Training/validation splits:

  • train
  • validation

Test splits:

  • test

VoxPopuli

VoxPopuli is a large-scale multilingual speech corpus consisting of political data sourced from 2009-2020 European Parliament event recordings. The English subset contains approximately 550 hours of speech largely from non-native English speakers. It is licensed under CC0.

Example usage:

voxpopuli = load_dataset("esb/datasets", "voxpopuli")

Training/validation splits:

  • train
  • validation

Test splits:

  • test

TED-LIUM

TED-LIUM consists of English-language TED Talk conference videos covering a range of different cultural, political, and academic topics. It contains approximately 450 hours of transcribed speech data. It is licensed under CC-BY-NC-ND 3.0.

Example usage:

tedlium = load_dataset("esb/datasets", "tedlium")

Training/validation splits:

  • train
  • validation

Test splits:

  • test

GigaSpeech

GigaSpeech is a multi-domain English speech recognition corpus created from audiobooks, podcasts and YouTube. We provide the large train set (2,500 hours) and the standard validation and test splits. It is licensed under apache-2.0.

Example usage:

gigaspeech = load_dataset("esb/datasets", "gigaspeech", use_auth_token=True)

Training/validation splits:

  • train (l subset of training data (2,500 h))
  • validation

Test splits:

  • test

Also available are subsets of the train split, which can be accessed by setting the subconfig argument:

gigaspeech = load_dataset("esb/datasets", "spgispeech", subconfig="xs", use_auth_token=True)
  • xs: extra-small subset of training data (10 h)
  • s: small subset of training data (250 h)
  • m: medium subset of training data (1,000 h)
  • xl: extra-large subset of training data (10,000 h)

SPGISpeech

SPGISpeech consists of company earnings calls that have been manually transcribed by S&P Global, Inc according to a professional style guide. We provide the large train set (5,000 hours) and the standard validation and test splits. It is licensed under a Kensho user agreement.

Loading the dataset requires authorization.

Example usage:

spgispeech = load_dataset("esb/datasets", "spgispeech", use_auth_token=True)

Training/validation splits:

  • train (l subset of training data (~5,000 h))
  • validation

Test splits:

  • test

Also available are subsets of the train split, which can be accessed by setting the subconfig argument:

spgispeech = load_dataset("esb/datasets", "spgispeech", subconfig="s", use_auth_token=True)
  • s: small subset of training data (~200 h)
  • m: medium subset of training data (~1,000 h)

Earnings-22

Earnings-22 is a 119-hour corpus of English-language earnings calls collected from global companies, with speakers of many different nationalities and accents. It is licensed under CC-BY-SA-4.0.

Example usage:

earnings22 = load_dataset("esb/datasets", "earnings22")

Training/validation splits:

  • train
  • validation

Test splits:

  • test

AMI

The AMI Meeting Corpus consists of 100 hours of meeting recordings from multiple recording devices synced to a common timeline. It is licensed under CC-BY-4.0.

Example usage:

ami = load_dataset("esb/datasets", "ami")

Training/validation splits:

  • train
  • validation

Test splits:

  • test
Downloads last month
3