You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

You agree to the following license terms:
This material and data is licensed under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0), The full text of the CC-BY 4.0 license is available at https://creativecommons.org/licenses/by/4.0/.

Notwithstanding the foregoing, this material and data may only be used, modified and distributed for the express purpose of training AI models, and subject to the foregoing restriction. In addition, this material and data may not be used in order to create audiovisual material that simulates the voice or likeness of the specific individuals appearing or speaking in such materials and data (a “deep-fake”). To the extent this paragraph is inconsistent with the CC-BY-4.0 license, the terms of this paragraph shall govern.

By downloading or using any of this material or data, you agree that the Project makes no representations or warranties in respect of the data, and shall have no liability in respect thereof. These disclaimers and limitations are in addition to any disclaimers and limitations set forth in the CC-BY-4.0 license itself. You understand that the project is only able to make available the materials and data pursuant to these disclaimers and limitations, and without such disclaimers and limitations the project would not be able to make available the materials and data for your use.

Log in or Sign Up to review the conditions and access this dataset content.

ivrit.ai is a database of Hebrew audio and text content.

audio-base contains the raw, unprocessed sources.

audio-vad contains audio snippets generated by applying Silero VAD (https://github.com/snakers4/silero-vad) to the base dataset.

audio-transcripts contains transcriptions for each snippet in the audio-vad dataset.

The audio-base dataset contains data from the following sources:

Paper: https://arxiv.org/abs/2307.08720

If you use our datasets, the following quote is preferable:

@misc{marmor2023ivritai,
      title={ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development}, 
      author={Yanir Marmor and Kinneret Misgav and Yair Lifshitz},
      year={2023},
      eprint={2307.08720},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}
Downloads last month
14