Datasets:

PetraAI
/

PetraAI

Tasks:

Text Classification

Token Classification

Table Question Answering

Languages: Arabic English

Size Categories: 1M<n<10M

Tags: chemistry biology finance

DOI:

License: apache-2.0

Dataset card Files Files and versions Community

Dataset Viewer

Go to dataset viewer

Viewer

The dataset viewer is not available for this dataset.

Cannot get the config names for the dataset.

Error code:   ConfigNamesError
Exception:    ValueError
Message:      Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('parquet', {}), NamedSplit('validation'): (None, {})}
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/dataset/config_names.py", line 55, in compute_config_names_response
                  for config in sorted(get_dataset_config_names(path=dataset, token=hf_token))
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 351, in get_dataset_config_names
                  dataset_module = dataset_module_factory(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1512, in dataset_module_factory
                  raise e1 from None
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1489, in dataset_module_factory
                  return HubDatasetModuleFactoryWithoutScript(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 1054, in get_module
                  module_name, default_builder_kwargs = infer_module_for_data_files(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/load.py", line 513, in infer_module_for_data_files
                  raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}")
              ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('parquet', {}), NamedSplit('validation'): (None, {})}

Need help to make the dataset viewer work? Open a discussion for direct support.

PETRA

Overview

PETRA is a multilingual dataset for training and evaluating AI systems on a diverse range of tasks across multiple modalities. It contains data in Arabic and English for tasks including translation, summarization, question answering, and more.

Dataset Structure

Data is separated by language into /ar and /en directories
Within each language directory, data is separated by task into subdirectories
Tasks include:
- Translation
- Summarization
- Conversational
- Feature extraction
- Zero-shot classification
- Text generation
- Fill mask
- Sentence similarity
- Text-to-speech
- Automatic speech recognition
- Text classification
- Token classification
- Table question answering
- Question answering
- Text2text generation
- Audio-to-audio
- Audio classification
- Voice activity detection
- Depth estimation
- Image classification
- Object detection
- Image segmentation
- Text-to-image
- Image-to-text
- Image-to-image
- Unconditional image generation
- Reinforcement learning
- Video classification
- Robotics
- Tabular classification
- Tabular regression
- Table-to-text
- Multiple choice
- Text retrieval
- Tabular-to-text
- Text-to-video
- Time series forecasting
- Visual question answering
- Zero-shot image classification
- Graph ML

Dataset Size

1M < n < 10M samples

Licenses

Apache 2.0

Citation

If you use this dataset, please cite it as:

[cite paper, arXiv, etc]

@article{PetraAI2022PetraAI, title={PetraAI: A Massive Multilingual Dataset for Machine Learning}, author={First Last and First Last}, journal={arXiv}, year={2022}, url={https://huggingface.co/datasets/PetraAI/PetraAI} }

Contact

For any questions, please reach out to [[email protected]]

Dataset Cards

What are Dataset Cards?

Each dataset may be documented by the README.md file in the repository. This file is called a dataset card, and the Model Database Hub will render its contents on the dataset’s main page. To inform users about how to responsibly use the data, it’s a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.

You can also add dataset metadata to your card. The metadata describes important information about a dataset such as its license, language, and size. It also contains tags to help users discover a dataset on the Hub. Tags are defined in a YAML metadata section at the top of the README.md file.

Dataset card metadata

A dataset repo will render its README.md as a dataset card. To control how the Hub displays the card, you should create a YAML section in the README file to define some metadata. Start by adding three --- at the top, then include all of the relevant metadata, and close the section with another group of --- like the example below:

The metadata that you add to the dataset card enables certain interactions on the Hub. For example:

Allow users to filter and discover datasets at https://huggingface.co/datasets.
If you choose a license using the keywords listed in the right column of this table, the license will be displayed on the dataset page.

When creating a README.md file in a dataset repository on the Hub, use Metadata UI to fill the main metadata:

To see metadata fields, see the detailed dataset card metadata specification here.

Dataset card creation guide

For a step-by-step guide on creating a dataset card, check out the Create a dataset card guide.

Reading through existing dataset cards, such as the ELI5 dataset card, is a great way to familiarize yourself with the common conventions.

Linking a Paper

If the dataset card includes a link to a paper on arXiv, the Hub will extract the arXiv ID and include it in the dataset tags with the format arxiv:<PAPER ID>. Clicking on the tag will let you:

Visit the Paper page
Filter for other models on the Hub that cite the same paper.

Models trained or fine-tuned on PetraAI/PetraAI

PetraAI/Nashmi

Updated 7 days ago