The dataset viewer is not available for this split.
Error code: StreamingRowsError Exception: NotImplementedError Message: Extraction protocol for TAR archives like 'http://www.openslr.org/resources/12/dev-clean.tar.gz' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead. Example usage: url = dl_manager.download(url) tar_archive_iterator = dl_manager.iter_archive(url) for filename, file in tar_archive_iterator: ... Traceback: Traceback (most recent call last): File "/src/services/worker/src/worker/utils.py", line 263, in get_rows_or_raise return get_rows( File "/src/services/worker/src/worker/utils.py", line 204, in decorator return func(*args, **kwargs) File "/src/services/worker/src/worker/utils.py", line 226, in get_rows ds = load_dataset( File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 2129, in load_dataset return builder_instance.as_streaming_dataset(split=split) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1329, in as_streaming_dataset splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)} File "/tmp/modules-cache/datasets_modules/datasets/superb/b8183f71eabe8c559d7f3f528ab37a6a21ad1ee088fd3423574cecad8b3ec67e/superb.py", line 354, in _split_generators archive_path = dl_manager.download_and_extract(_DL_URLS) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1063, in download_and_extract return self.extract(self.download(url_or_urls)) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1015, in extract urlpaths = map_nested(self._extract, url_or_urls, map_tuple=True) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 463, in map_nested mapped = [ File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 464, in <listcomp> _single_map_nested((function, obj, types, None, True, None)) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 366, in _single_map_nested return function(data_struct) File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 1025, in _extract raise NotImplementedError( NotImplementedError: Extraction protocol for TAR archives like 'http://www.openslr.org/resources/12/dev-clean.tar.gz' is not implemented in streaming mode. Please use `dl_manager.iter_archive` instead. Example usage: url = dl_manager.download(url) tar_archive_iterator = dl_manager.iter_archive(url) for filename, file in tar_archive_iterator: ...
Need help to make the dataset viewer work? Open a discussion for direct support.
Dataset Card for SUPERB
Dataset Summary
SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data.
Supported Tasks and Leaderboards
The SUPERB leaderboard can be found here https://superbbenchmark.org/leaderboard and consists of the following tasks:
pr
Phoneme Recognition (PR) transcribes an utterance into the smallest content units. This task includes alignment modeling to avoid potentially inaccurate forced alignment. LibriSpeech train-clean-100/dev-clean/test-clean subsets are adopted in SUPERB for training/validation/testing. Phoneme transcriptions are obtained from the LibriSpeech official g2p-model-5 and the conversion script in Kaldi librispeech s5 recipe. The evaluation metric is phone error rate (PER).
asr
Automatic Speech Recognition (ASR) transcribes utterances into words. While PR analyzes the improvement in modeling phonetics, ASR reflects the significance of the improvement in a real-world scenario. LibriSpeech train-clean-100/devclean/test-clean subsets are used for training/validation/testing. The evaluation metric is word error rate (WER).
ks
Keyword Spotting (KS) detects preregistered keywords by classifying utterances into a predefined set of words. The task is usually performed on-device for the fast response time. Thus, accuracy, model size, and inference time are all crucial. SUPERB uses the widely used Speech Commands dataset v1.0 for the task. The dataset consists of ten classes of keywords, a class for silence, and an unknown class to include the false positive. The evaluation metric is accuracy (ACC)
Example of usage:
Use these auxillary functions to:
- load the audio file into an audio data array
- sample from long
_silence_
audio clips
For other examples of handling long _silence_
clips see the S3PRL
or TFDS implementations.
def map_to_array(example):
import soundfile as sf
speech_array, sample_rate = sf.read(example["file"])
example["speech"] = speech_array
example["sample_rate"] = sample_rate
return example
def sample_noise(example):
# Use this function to extract random 1 sec slices of each _silence_ utterance,
# e.g. inside `torch.utils.data.Dataset.__getitem__()`
from random import randint
if example["label"] == "_silence_":
random_offset = randint(0, len(example["speech"]) - example["sample_rate"] - 1)
example["speech"] = example["speech"][random_offset : random_offset + example["sample_rate"]]
return example
qbe
Query by Example Spoken Term Detection (QbE) detects a spoken term (query) in an audio database (documents) by binary discriminating a given pair of query and document into a match or not. The English subset in QUESST 2014 challenge is adopted since we focus on investigating English as the first step. The evaluation metric is maximum term weighted value (MTWV) which balances misses and false alarms.
ic
Intent Classification (IC) classifies utterances into predefined classes to determine the intent of speakers. SUPERB uses the Fluent Speech Commands dataset, where each utterance is tagged with three intent labels: action, object, and location. The evaluation metric is accuracy (ACC).
sf
Slot Filling (SF) predicts a sequence of semantic slot-types from an utterance, like a slot-type FromLocation for a spoken word Taipei, which is known as a slot-value. Both slot-types and slot-values are essential for an SLU system to function. The evaluation metrics thus include slot-type F1 score and slotvalue CER. Audio SNIPS is adopted, which synthesized multi-speaker utterances for SNIPS. Following the standard split in SNIPS, US-accent speakers are further selected for training, and others are for validation/testing.
si
Speaker Identification (SI) classifies each utterance for its speaker identity as a multi-class classification, where speakers are in the same predefined set for both training and testing. The widely used VoxCeleb1 dataset is adopted, and the evaluation metric is accuracy (ACC).
asv
Automatic Speaker Verification (ASV) verifies whether the speakers of a pair of utterances match as a binary classification, and speakers in the testing set may not appear in the training set. Thus, ASV is more challenging than SID. VoxCeleb1 is used without VoxCeleb2 training data and noise augmentation. The evaluation metric is equal error rate (EER).
sd
Speaker Diarization (SD) predicts who is speaking when for each timestamp, and multiple speakers can speak simultaneously. The model has to encode rich speaker characteristics for each frame and should be able to represent mixtures of signals. LibriMix is adopted where LibriSpeech train-clean-100/dev-clean/test-clean are used to generate mixtures for training/validation/testing. We focus on the two-speaker scenario as the first step. The time-coded speaker labels were generated using alignments from Kaldi LibriSpeech ASR model. The evaluation metric is diarization error rate (DER).
Example of usage
Use these auxiliary functions to:
- load the audio file into an audio data array
- generate the label array
def load_audio_file(example, frame_shift=160):
import soundfile as sf
example["array"], example["sample_rate"] = sf.read(
example["file"], start=example["start"] * frame_shift, stop=example["end"] * frame_shift
)
return example
def generate_label(example, frame_shift=160, num_speakers=2, rate=16000):
import numpy as np
start = example["start"]
end = example["end"]
frame_num = end - start
speakers = sorted({speaker["speaker_id"] for speaker in example["speakers"]})
label = np.zeros((frame_num, num_speakers), dtype=np.int32)
for speaker in example["speakers"]:
speaker_index = speakers.index(speaker["speaker_id"])
start_frame = np.rint(speaker["start"] * rate / frame_shift).astype(int)
end_frame = np.rint(speaker["end"] * rate / frame_shift).astype(int)
rel_start = rel_end = None
if start <= start_frame < end:
rel_start = start_frame - start
if start < end_frame <= end:
rel_end = end_frame - start
if rel_start is not None or rel_end is not None:
label[rel_start:rel_end, speaker_index] = 1
example["label"] = label
return example
er
Emotion Recognition (ER) predicts an emotion class for each utterance. The most widely used ER dataset IEMOCAP is adopted, and we follow the conventional evaluation protocol: we drop the unbalance emotion classes to leave the final four classes with a similar amount of data points and cross-validates on five folds of the standard splits. The evaluation metric is accuracy (ACC).
Languages
The language data in SUPERB is in English (BCP-47 en
)
Dataset Structure
Data Instances
pr
asr
An example from each split looks like:
{'chapter_id': 1240,
'file': 'path/to/file.flac',
'audio': {'path': 'path/to/file.flac',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'id': '103-1240-0000',
'speaker_id': 103,
'text': 'CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE '
'LIVED JUST WHERE THE AVONLEA MAIN ROAD DIPPED DOWN INTO A LITTLE '
'HOLLOW FRINGED WITH ALDERS AND LADIES EARDROPS AND TRAVERSED BY A '
'BROOK'}
ks
An example from each split looks like:
{
'file': '/path/yes/af7a8296_nohash_1.wav',
'audio': {'path': '/path/yes/af7a8296_nohash_1.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'label': 0 # 'yes'
}
qbe
ic
{
'file': "/path/wavs/speakers/2BqVo8kVB2Skwgyb/063aa8f0-4479-11e9-a9a5-5dbec3b8816a.wav",
'audio': {'path': '/path/wavs/speakers/2BqVo8kVB2Skwgyb/063aa8f0-4479-11e9-a9a5-5dbec3b8816a.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'speaker_id': '2BqVo8kVB2Skwgyb',
'text': 'Turn the bedroom lights off',
'action': 3, # 'deactivate'
'object': 7, # 'lights'
'location': 0 # 'bedroom'
}
sf
si
{
'file': '/path/wav/id10003/na8-QEFmj44/00003.wav',
'audio': {'path': '/path/wav/id10003/na8-QEFmj44/00003.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'label': 2 # 'id10003'
}
asv
sd
An example from each split looks like:
{
'record_id': '1578-6379-0038_6415-111615-0009',
'file': 'path/to/file.wav',
'audio': {'path': 'path/to/file.wav',
'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32),
'sampling_rate': 16000},
'start': 0,
'end': 1590,
'speakers': [
{'speaker_id': '1578', 'start': 28, 'end': 657},
{'speaker_id': '6415', 'start': 28, 'end': 1576}
]
}
er
Data Fields
####Note abouth the audio
fields
When accessing the audio column: dataset[0]["audio"]
the audio file is automatically decoded and resampled to dataset.features["audio"].sampling_rate
. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the "audio"
column, i.e. dataset[0]["audio"]
should always be preferred over dataset["audio"][0]
.
pr
asr
file
(string
): Path to the WAV audio file.audio
(dict
): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.text
(string
): The transcription of the audio file.speaker_id
(integer
): A unique ID of the speaker. The same speaker id can be found for multiple data samples.chapter_id
(integer
): ID of the audiobook chapter which includes the transcription.id
(string
): A unique ID of the data sample.
ks
file
(string
): Path to the WAV audio file.audio
(dict
): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.label
(ClassLabel
): Label of the spoken command. Possible values:0: "yes", 1: "no", 2: "up", 3: "down", 4: "left", 5: "right", 6: "on", 7: "off", 8: "stop", 9: "go", 10: "_silence_", 11: "_unknown_"
qbe
ic
file
(string
): Path to the WAV audio file.audio
(dict
): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.speaker_id
(string
): ID of the speaker.text
(string
): Transcription of the spoken command.action
(ClassLabel
): Label of the command's action. Possible values:0: "activate", 1: "bring", 2: "change language", 3: "deactivate", 4: "decrease", 5: "increase"
object
(ClassLabel
): Label of the command's object. Possible values:0: "Chinese", 1: "English", 2: "German", 3: "Korean", 4: "heat", 5: "juice", 6: "lamp", 7: "lights", 8: "music", 9: "newspaper", 10: "none", 11: "shoes", 12: "socks", 13: "volume"
location
(ClassLabel
): Label of the command's location. Possible values:0: "bedroom", 1: "kitchen", 2: "none", 3: "washroom"
sf
si
file
(string
): Path to the WAV audio file.audio
(dict
): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.label
(ClassLabel
): Label (ID) of the speaker. Possible values:0: "id10001", 1: "id10002", 2: "id10003", ..., 1250: "id11251"
asv
sd
The data fields in all splits are:
record_id
(string
): ID of the record.file
(string
): Path to the WAV audio file.audio
(dict
): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.start
(integer
): Start frame of the audio.end
(integer
): End frame of the audio.speakers
(list
ofdict
): List of speakers in the audio. Each item contains the fields:speaker_id
(string
): ID of the speaker.start
(integer
): Frame when the speaker starts speaking.end
(integer
): Frame when the speaker stops speaking.
er
file
(string
): Path to the WAV audio file.audio
(dict
): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate.label
(ClassLabel
): Label of the speech emotion. Possible values:0: "neu", 1: "hap", 2: "ang", 3: "sad"
Data Splits
pr
asr
train | validation | test | |
---|---|---|---|
asr | 28539 | 2703 | 2620 |
ks
train | validation | test | |
---|---|---|---|
ks | 51094 | 6798 | 3081 |
qbe
ic
train | validation | test | |
---|---|---|---|
ic | 23132 | 3118 | 3793 |
sf
si
train | validation | test | |
---|---|---|---|
si | 138361 | 6904 | 8251 |
asv
sd
The data is split into "train", "dev" and "test" sets, each containing the following number of examples:
train | dev | test | |
---|---|---|---|
sd | 13901 | 3014 | 3002 |
er
The data is split into 5 sets intended for 5-fold cross-validation:
session1 | session2 | session3 | session4 | session5 | |
---|---|---|---|---|---|
er | 1085 | 1023 | 1151 | 1031 | 1241 |
Dataset Creation
Curation Rationale
[More Information Needed]
Source Data
Initial Data Collection and Normalization
[More Information Needed]
Who are the source language producers?
[More Information Needed]
Annotations
Annotation process
[More Information Needed]
Who are the annotators?
[More Information Needed]
Personal and Sensitive Information
The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset.
Considerations for Using the Data
Social Impact of Dataset
[More Information Needed]
Discussion of Biases
[More Information Needed]
Other Known Limitations
Dataset provided for research purposes only. Please check dataset license for additional information.
Additional Information
Dataset Curators
[More Information Needed]
Licensing Information
pr and asr
The license for Librispeech is the Creative Commons Attribution 4.0 International license ((CC-BY-4.0)[https://creativecommons.org/licenses/by/4.0/]).
ks
The license for Speech Commands is CC BY 4.0
qbe
The license for QUESST 2014 is not known.
ic
The license for Fluent Speech Commands dataset is the Fluent Speech Commands Public License
sf
The license for Audio SNIPS dataset is not known.
si and asv
The license for VoxCeleb1 dataset is the Creative Commons Attribution 4.0 International license (CC-BY-4.0).
sd
LibriMix is based on the LibriSpeech (see above) and Wham! noises datasets. The Wham! noises dataset is distributed under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.
er
The IEMOCAP license is ditributed under its own license.
Citation Information
@article{DBLP:journals/corr/abs-2105-01051,
author = {Shu{-}Wen Yang and
Po{-}Han Chi and
Yung{-}Sung Chuang and
Cheng{-}I Jeff Lai and
Kushal Lakhotia and
Yist Y. Lin and
Andy T. Liu and
Jiatong Shi and
Xuankai Chang and
Guan{-}Ting Lin and
Tzu{-}Hsien Huang and
Wei{-}Cheng Tseng and
Ko{-}tik Lee and
Da{-}Rong Liu and
Zili Huang and
Shuyan Dong and
Shang{-}Wen Li and
Shinji Watanabe and
Abdelrahman Mohamed and
Hung{-}yi Lee},
title = {{SUPERB:} Speech processing Universal PERformance Benchmark},
journal = {CoRR},
volume = {abs/2105.01051},
year = {2021},
url = {https://arxiv.org/abs/2105.01051},
archivePrefix = {arXiv},
eprint = {2105.01051},
timestamp = {Thu, 01 Jul 2021 13:30:22 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2105-01051.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Note that each SUPERB dataset has its own citation. Please see the source to see
the correct citation for each contained dataset.
Contributions
Thanks to @lewtun, @albertvillanova and @anton-l for adding this dataset.
- Downloads last month
- 4,890