Datasets:
The dataset viewer is not available for this dataset.
Error code: ConfigNamesError Exception: FileNotFoundError Message: Couldn't find a dataset script at /src/services/worker/hf-audio/esb-datasets-test-only/esb-datasets-test-only.py or any data file in the same directory. Couldn't find 'hf-audio/esb-datasets-test-only' on the Model Database Hub either: FileNotFoundError: No (supported) data files or dataset script found in hf-audio/esb-datasets-test-only. Traceback: Traceback (most recent call last): File "/src/services/worker/src/worker/job_runners/dataset/config_names.py", line 55, in compute_config_names_response for config in sorted(get_dataset_config_names(path=dataset, token=hf_token)) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 351, in get_dataset_config_names dataset_module = dataset_module_factory( File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1508, in dataset_module_factory raise FileNotFoundError( FileNotFoundError: Couldn't find a dataset script at /src/services/worker/hf-audio/esb-datasets-test-only/esb-datasets-test-only.py or any data file in the same directory. Couldn't find 'hf-audio/esb-datasets-test-only' on the Model Database Hub either: FileNotFoundError: No (supported) data files or dataset script found in hf-audio/esb-datasets-test-only.
Need help to make the dataset viewer work? Open a discussion for direct support.
All eight of datasets in ESB can be downloaded and prepared in just a single line of code through the Model Database Datasets library:
from datasets import load_dataset
librispeech = load_dataset("esb/datasets", "librispeech", split="train")
"esb/datasets"
: the repository namespace. This is fixed for all ESB datasets."librispeech"
: the dataset name. This can be changed to any of any one of the eight datasets in ESB to download that dataset.split="train"
: the split. Set this to one of train/validation/test to generate a specific split. Omit thesplit
argument to generate all splits for a dataset.
The datasets are full prepared, such that the audio and transcription files can be used directly in training/evaluation scripts.
Dataset Information
A data point can be accessed by indexing the dataset object loaded through load_dataset
:
print(librispeech[0])
A typical data point comprises the path to the audio file and its transcription. Also included is information of the dataset from which the sample derives and a unique identifier name:
{
'dataset': 'librispeech',
'audio': {'path': '/home/sanchit-gandhi/.cache/huggingface/datasets/downloads/extracted/d2da1969fe9e7d06661b5dc370cf2e3c119a14c35950045bcb76243b264e4f01/374-180298-0000.flac',
'array': array([ 7.01904297e-04, 7.32421875e-04, 7.32421875e-04, ...,
-2.74658203e-04, -1.83105469e-04, -3.05175781e-05]),
'sampling_rate': 16000},
'text': 'chapter sixteen i might have told you of the beginning of this liaison in a few lines but i wanted you to see every step by which we came i to agree to whatever marguerite wished',
'id': '374-180298-0000'
}
Data Fields
dataset
: name of the ESB dataset from which the sample is taken.audio
: a dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.text
: the transcription of the audio file.id
: unique id of the data sample.
Data Preparation
Audio
The audio for all ESB datasets is segmented into sample lengths suitable for training ASR systems. The Model Database datasets library decodes audio files on the fly, reading the segments and converting them to a Python arrays. Consequently, no further preparation of the audio is required to be used in training/evaluation scripts.
Note that when accessing the audio column: dataset[0]["audio"]
the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate
. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio"
column, i.e. dataset[0]["audio"]
should always be preferred over dataset["audio"][0]
.
Transcriptions
The transcriptions corresponding to each audio file are provided in their 'error corrected' format. No transcription pre-processing is applied to the text, only necessary 'error correction' steps such as removing junk tokens (<unk>) or converting symbolic punctuation to spelled out form (<comma> to ,). As such, no further preparation of the transcriptions is required to be used in training/evaluation scripts.
Transcriptions are provided for training and validation splits. The transcriptions are not provided for the test splits. ESB requires you to generate predictions for the test sets and upload them to https://huggingface.co/spaces/esb/leaderboard for scoring.
Access
All eight of the datasets in ESB are accessible and licensing is freely available. Three of the ESB datasets have specific terms of usage that must be agreed to before using the data. To do so, fill in the access forms on the specific datasets' pages:
- Common Voice: https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0
- GigaSpeech: https://huggingface.co/datasets/speechcolab/gigaspeech
- SPGISpeech: https://huggingface.co/datasets/kensho/spgispeech
Diagnostic Dataset
ESB contains a small, 8h diagnostic dataset of in-domain validation data with newly annotated transcriptions. The audio data is sampled from each of the ESB validation sets, giving a range of different domains and speaking styles. The transcriptions are annotated according to a consistent style guide with two formats: normalised and un-normalised. The dataset is structured in the same way as the ESB dataset, by grouping audio-transcription samples according to the dataset from which they were taken. We encourage participants to use this dataset when evaluating their systems to quickly assess performance on a range of different speech recognition conditions. For more information, visit: esb/diagnostic-dataset.
Summary of ESB Datasets
Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License |
---|---|---|---|---|---|---|---|
LibriSpeech | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 |
Common Voice | Wikipedia | Narrated | 1409 | 27 | 27 | Punctuated & Cased | CC0-1.0 |
Voxpopuli | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 |
TED-LIUM | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 |
GigaSpeech | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 |
SPGISpeech | Fincancial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement |
Earnings-22 | Fincancial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 |
AMI | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 |
LibriSpeech
The LibriSpeech corpus is a standard large-scale corpus for assessing ASR systems. It consists of approximately 1,000 hours of narrated audiobooks from the LibriVox project. It is licensed under CC-BY-4.0.
Example Usage:
librispeech = load_dataset("esb/datasets", "librispeech")
Train/validation splits:
train
(combination oftrain.clean.100
,train.clean.360
andtrain.other.500
)validation.clean
validation.other
Test splits:
test.clean
test.other
Also available are subsets of the train split, which can be accessed by setting the subconfig
argument:
librispeech = load_dataset("esb/datasets", "librispeech", subconfig="clean.100")
clean.100
: 100 hours of training data from the 'clean' subsetclean.360
: 360 hours of training data from the 'clean' subsetother.500
: 500 hours of training data from the 'other' subset
Common Voice
Common Voice is a series of crowd-sourced open-licensed speech datasets where speakers record text from Wikipedia in various languages. The speakers are of various nationalities and native languages, with different accents and recording conditions. We use the English subset of version 9.0 (27-4-2022), with approximately 1,400 hours of audio-transcription data. It is licensed under CC0-1.0.
Example usage:
common_voice = load_dataset("esb/datasets", "common_voice", use_auth_token=True)
Training/validation splits:
train
validation
Test splits:
test
VoxPopuli
VoxPopuli is a large-scale multilingual speech corpus consisting of political data sourced from 2009-2020 European Parliament event recordings. The English subset contains approximately 550 hours of speech largely from non-native English speakers. It is licensed under CC0.
Example usage:
voxpopuli = load_dataset("esb/datasets", "voxpopuli")
Training/validation splits:
train
validation
Test splits:
test
TED-LIUM
TED-LIUM consists of English-language TED Talk conference videos covering a range of different cultural, political, and academic topics. It contains approximately 450 hours of transcribed speech data. It is licensed under CC-BY-NC-ND 3.0.
Example usage:
tedlium = load_dataset("esb/datasets", "tedlium")
Training/validation splits:
train
validation
Test splits:
test
GigaSpeech
GigaSpeech is a multi-domain English speech recognition corpus created from audiobooks, podcasts and YouTube. We provide the large train set (2,500 hours) and the standard validation and test splits. It is licensed under apache-2.0.
Example usage:
gigaspeech = load_dataset("esb/datasets", "gigaspeech", use_auth_token=True)
Training/validation splits:
train
(l
subset of training data (2,500 h))validation
Test splits:
test
Also available are subsets of the train split, which can be accessed by setting the subconfig
argument:
gigaspeech = load_dataset("esb/datasets", "spgispeech", subconfig="xs", use_auth_token=True)
xs
: extra-small subset of training data (10 h)s
: small subset of training data (250 h)m
: medium subset of training data (1,000 h)xl
: extra-large subset of training data (10,000 h)
SPGISpeech
SPGISpeech consists of company earnings calls that have been manually transcribed by S&P Global, Inc according to a professional style guide. We provide the large train set (5,000 hours) and the standard validation and test splits. It is licensed under a Kensho user agreement.
Loading the dataset requires authorization.
Example usage:
spgispeech = load_dataset("esb/datasets", "spgispeech", use_auth_token=True)
Training/validation splits:
train
(l
subset of training data (~5,000 h))validation
Test splits:
test
Also available are subsets of the train split, which can be accessed by setting the subconfig
argument:
spgispeech = load_dataset("esb/datasets", "spgispeech", subconfig="s", use_auth_token=True)
s
: small subset of training data (~200 h)m
: medium subset of training data (~1,000 h)
Earnings-22
Earnings-22 is a 119-hour corpus of English-language earnings calls collected from global companies, with speakers of many different nationalities and accents. It is licensed under CC-BY-SA-4.0.
Example usage:
earnings22 = load_dataset("esb/datasets", "earnings22")
Training/validation splits:
train
validation
Test splits:
test
AMI
The AMI Meeting Corpus consists of 100 hours of meeting recordings from multiple recording devices synced to a common timeline. It is licensed under CC-BY-4.0.
Example usage:
ami = load_dataset("esb/datasets", "ami")
Training/validation splits:
train
validation
Test splits:
test
- Downloads last month
- 3