The dataset viewer is not available for this split.
Server error while post-processing the split rows. Please report the issue.
Error code:   RowsPostProcessingError

Need help to make the dataset viewer work? Open a discussion for direct support.

GMaSC: GEC Barton Hill Malayalam Speech Corpus

GMaSC is a Malayalam text and speech corpus created by the Government Engineering College Barton Hill with an emphasis on Malayalam-accented English. The corpus contains 2,000 text-audio pairs of Malayalam sentences spoken by 2 speakers, totalling in approximately 139 minutes of audio. Each sentences has at least one English word common in Malayalam speech.

Dataset Structure

The dataset consists of 2,000 instances with fields text, speaker, and audio. The audio is mono, sampled at 48kH. The transcription is normalized and only includes Malayalam characters and common punctuation. The table given below specifies how the 2,000 instances are split between the speakers, along with some basic speaker info:

Speaker Gender Age Time (HH:MM:SS) Sentences
Sonia Female 43 01:02:17 1,000
Anil Male 48 01:17:23 1,000
Total 02:19:40 2,000

Data Instances

An example instance is given below:

{'text': 'സൗജന്യ ആയുർവേദ മെഡിക്കൽ ക്യാമ്പ്',
 'speaker': 'Sonia',
 'audio': {'path': None,
  'array': array([0.00036621, 0.00033569, 0.0005188 , ..., 0.00094604, 0.00091553,
         0.00094604]),
  'sampling_rate': 48000}}

Data Fields

  • text (str): Transcription of the audio file
  • speaker (str): The name of the speaker
  • audio (dict): Audio object including loaded audio array, sampling rate and path to audio (always None)

Data Splits

We provide all the data in a single train split. The loaded dataset object thus looks like this:

DatasetDict({
     train: Dataset({
         features: ['text', 'speaker', 'audio'],
         num_rows: 2000
     })
 })

Additional Information

Licensing

The corpus is made available under the Creative Commons license (CC BY-SA 4.0).

Downloads last month
11
Edit dataset card
Evaluate models HF Leaderboard