Datasets:
audio
audio
| label
class label
2 classes
|
---|---|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
0
(clips_16k_mp3) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
|
1
(clips_48k_wav) |
Dataset Card for NBAiLab/NPSC
The Norwegian Parliament Speech Corpus (NPSC) is a corpus for training a Norwegian ASR (Automatic Speech Recognition) models. The corpus is created by Språkbanken at the National Library in Norway.
NPSC is based on sound recording from meeting in the Norwegian Parliament. These talks are orthographically transcribed to either Norwegian Bokmål or Norwegian Nynorsk. In addition to the data actually included in this dataset, there is a significant amount of metadata that is included in the original corpus. Through the speaker id there is additional information about the speaker, like gender, age, and place of birth (ie dialect). Through the proceedings id the corpus can be linked to the official proceedings from the meetings.
The corpus is in total sound recordings from 40 entire days of meetings. This amounts to 140 hours of speech, 65,000 sentences or 1.2 million words.
This corpus is an adaption of the original corpus made for efficiant ASR training. For simplicity and portability, a few of the original datasets features, like the token transcription, is ommitted. You can find the complete dataset at the Resource Catalogue at Språkbanken.
How to Use (This needs to be edited of course)
from datasets import load_dataset
data = load_dataset("nb/NPSC", streaming=True)
Data Fields
Currently there are two versions included in this repo.
Version A
This verison has a short list of the metadata and includes the audio (48k mp3) encoded as a float32 array in the dataset itself.
The current dataloader script is associated with this version.
One line in train.json looks like this:
{
"sentence_id": 7309,
"sentence_order": 0,
"speaker_id": 1,
"speaker_name": "Marit Nybakk",
"sentence_text": "Stortingets møte er lovlig satt",
"sentence_language_code": "nb-NO",
"text": "Stortingets møte er lovlig satt",
"start_time": 302650,
"end_time": 306000,
"normsentence_text": "Stortingets møte er lovlig satt",
"transsentence_text": "Stortingets møte er lovleg sett",
"translated": 1,
"audio": {
"path": "audio/20170207-095506_302650_306000.wav",
"array": [
24,
25,
50,
(...)
],
"sampling_rate": 48000
}
}
Version B
This verison does not contain the audio encoded in the dataset. Instead it has the audio files placed in sub-directories. There are currently both samples in clips_48k_wav and clips_16k_mp3. Only the base filename is referred in the dataset. Please not that there are both sentence-based audio clips as well at meeting-based audio clips. The dataset contains referrals to both, the latter referral has start and stop time as well.
One line in the train/metadata.json looks like this:
{
"meeting_date": "20170207",
"full_audio_file": "20170207-095506",
"proceedings_file": "20170207-095506.ref",
"duration": 4442474,
"transcriber_id": 1,
"reviewer_id": 2,
"data_split": "test",
"speaker_name": "Marit Nybakk",
"speaker_id": 1,
"sentence_id": 7309,
"sentence_language_code": "nb-NO",
"sentence_text": "Stortingets møte er lovlig satt",
"sentence_order": 0,
"audio_file": "20170207-095506_302650_306000",
"start_time": 302650,
"end_time": 306000,
"normsentence_text": "Stortingets møte er lovlig satt",
"transsentence_text": "Stortingets møte er lovleg sett",
"translated": 1
}
Dataset Creation
We are providing a train, dev and test split. These are the same as in the orginal corpus.
Build date: 20012022
Initial Data Collection and Curation
The procedure for the dataset creation is described in detail in the paper.
Statistics
Feature | Value |
---|---|
Duration, pauses included | 140,3 hours |
Duration, pauses not included | 125,7 hours |
Word count | 1,2 million |
Sentence count | 64.531 |
Language distribution | Nynorsk: 12,8% |
Bokmål: 87,2%% | |
Gender distribution | Female: 38,3% |
Male: 61.7% |
Considerations for Using the Data
This corpus contains speech data and is allowed to be used outside the National Library of Norway for speech recognition technology purposes.
Discussion of Biases
Please refer to our paper.
Dataset Curators
Freddy Wetjen, Andre Kaasen and Per Egil Kummervold has contributed to porting it to the Model Database Dataset format.
Licensing Information
Licensed for use outside the National Library of Norway.
License
CC-ZERO(https://creativecommons.org/publicdomain/zero/1.0/)
Citation Information
We are preparing an article with detailed information about this corpus. Until it is published, please cite out paper discussing the first version of this corpus:
ANDRE: TO BE DONE
- Downloads last month
- 341