Dataset Card for "ArtELingo"

Name: artelingo
Creator: mohamed
License: https://choosealicense.com/licenses/other/

Dataset Summary

ArtELingo is a benchmark and dataset introduced in a research paper aimed at promoting work on diversity across languages and cultures. It is an extension of ArtEmis, which is a collection of 80,000 artworks from WikiArt with 450,000 emotion labels and English-only captions. ArtELingo expands this dataset by adding 790,000 annotations in Arabic and Chinese. The purpose of these additional annotations is to evaluate the performance of "cultural-transfer" in AI systems.

The goal of ArtELingo is to encourage research on multilinguality and culturally-aware AI. By including annotations in multiple languages and considering cultural differences, the dataset aims to build more human-compatible AI that is sensitive to emotional nuances across various cultural contexts. The researchers believe that studying emotions in this way is crucial to understanding a significant aspect of human intelligence.

Supported Tasks and Leaderboards

We have two tasks:

Both challenges have a leaderboard on Eval.ai. Submission deadlines can be viewed from the above links.

In addition, we are hosting the challenge at the ICCV23 workshop WECIA. We have cash prizes for winners.

Languages

We have 3 languages: English, Arabic, and Chinese. For each image, we have at least 5 captions in each language.

In total we have 80,000 images which are downloaded automatically with the dataset.

Dataset Structure

We show detailed information for all the configurations of the dataset.

Dataset Configurations

We have 4 Configurations:

artelingo

Size of downloaded dataset files: 23 GB
Splits: ['train', 'test', 'val']
Number of Samples per splits: [920K, 94.1K, 46.9K]
Loading Script:

from datasets import load_dataset
dataset = load_dataset(path="youssef101/artelingo", name='artelingo')

you can also provide a splits:LIST(str) parameter to avoid downloading the huge files for all the splits. (especially the train set :))

from datasets import load_dataset
dataset = load_dataset(path="youssef101/artelingo", name='artelingo', splits=['val'])

Notice that this deems the next dev configuration redundant.

dev

Size of downloaded dataset files: 3 GB
Splits: ['test', 'val']
Number of Samples per splits: [94.1K, 46.9K]
Loading Script:

from datasets import load_dataset
dataset = load_dataset(path="youssef101/artelingo", name='dev')

wecia-emo

Intended for the WECIA emotion prediction challenge. Instances does not have the emotion or the language attributes.

Size of downloaded dataset files: 1.2 GB
Splits: ['dev']
Number of Samples per splits: [27.9K]
Loading Script:

from datasets import load_dataset
dataset = load_dataset(path="youssef101/artelingo", name='wecia-emo')

wecia-cap

Intended for the WECIA affective caption generation challenge. Instances does not have the text.

Size of downloaded dataset files: 1.2 GB
Splits: ['dev']
Number of Samples per splits: [16.3K]
Loading Script:

from datasets import load_dataset
dataset = load_dataset(path="youssef101/artelingo", name='wecia-cap')

Data Fields

The data fields are the same among all configs.

uid: a int32 feature. A unique identifier for each instance.
image: a PIL.Image feature. The image of the artwork from the wikiart dataset.
art_style: a string feature. The art style of the artwork. Styles are a subset from the wikiart styles.
painting: a string feature. The name of the painting according to the wikiart dataset.
emotion: a string feature. The emotion associated with the image caption pair.
language: a string feature. The language used to write the caption.
text: a string feature. The affective caption that describes the painting under the context of the selected emotion.

Dataset Creation

Curation Rationale

ArtELingo is a benchmark and dataset designed to promote research on diversity across languages and cultures. It builds upon ArtEmis, a collection of 80,000 artworks from WikiArt with 450,000 emotion labels and English-only captions. ArtELingo extends this dataset by adding 790,000 annotations in Arabic and Chinese, as well as 4,800 annotations in Spanish, allowing for the evaluation of "cultural-transfer" performance in AI systems. With many artworks having multiple annotations in three languages, the dataset enables the investigation of similarities and differences across linguistic and cultural contexts. Additionally, ArtELingo explores captioning tasks, demonstrating how diversity in annotations can improve the performance of baseline AI models. The hope is that ArtELingo will facilitate future research on multilinguality and culturally-aware AI. The dataset is publicly available, including standard splits and baseline models, to support and ease further research in this area.

Source Data

Initial Data Collection and Normalization

ArtELingo uses images from the wikiart dataset. The images are mainly artworks since they are created with the intention to have an emotional impact on the viewer. ArtELingo assumes that WikiArt is a representative sample of the cultures of interest. While WikiArt is remarkably comprehensive, it has better coverage of the West than other regions of the world based on WikiArt’s assignment of artworks to nationalities.

The data was collected via Amazon Mechanical Turk, where only native speakers were allowed to annotate the images. The English, Arabic, and Chinese subsets were collected by 6377, 656, and 745 workers respectively. All workers were compensated with above minimal wage in each respective country.

Who are the source language producers?

The data comes from Human annotators who natively speak each respective language.

Considerations for Using the Data

Social Impact of Dataset

When using the ArtELingo dataset, researchers and developers must be mindful of the potential social impact of the data. Emotions, cultural expressions, and artistic representations can be sensitive topics, and AI systems trained on such data may have implications on how they perceive and respond to users. It is crucial to ensure that the dataset's usage does not perpetuate stereotypes or biases related to specific cultures or languages. Ethical considerations should be taken into account during the development and deployment of AI models trained on ArtELingo to avoid any harmful consequences on individuals or communities.

Discussion of Biases

ArtELingo was filtered against hate speech, racism, and obvious stereotypes. However, Like any dataset, ArtELingo may contain inherent biases that could influence the performance and behavior of AI systems. These biases could arise from various sources, such as cultural differences in emotional interpretations, variations in annotator perspectives, or imbalances in the distribution of annotations across languages and cultures. Researchers should be cautious about potential biases that might impact the dataset's outcomes and address them appropriately. Transparently discussing and documenting these biases is essential to facilitate a fair understanding of the dataset's limitations and potential areas of improvement.

Additional Information

Dataset Curators

The corpus was put together by Youssef Mohamed, Mohamed Abdelfattah, Shyma Alhuwaider, Feifan Li, Xiangliang Zhang, Kenneth Ward Church and Mohamed Elhoseiny.

Licensing Information

Terms of Use: Before we are able to offer you access to the database, please agree to the following terms of use. After approval, you (the 'Researcher') receive permission to use the ArtELingo database (the 'Database') at King Abdullah University of Science and Technology (KAUST). In exchange for being able to join the ArtELingo community and receive such permission, Researcher hereby agrees to the following terms and conditions: [1.] The Researcher shall use the Database only for non-commercial research and educational purposes. [2.] The Universities make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose. [3.] Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the Universities, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, and Researcher's use of any copies of copyrighted 2D artworks originally uploaded to http://www.wikiart.org that the Researcher may use in connection with the Database.
[4.] Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions. [5.] The Universities reserve the right to terminate Researcher's access to the Database at any time. [6.] If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer. [7.] The international copyright laws shall apply to all disputes under this agreement.

Citation Information

@inproceedings{mohamed2022artelingo,
  title={ArtELingo: A Million Emotion Annotations of WikiArt with Emphasis on Diversity over Language and Culture},
  author={Mohamed, Youssef and Abdelfattah, Mohamed and Alhuwaider, Shyma and Li, Feifan and Zhang, Xiangliang and Church, Kenneth and Elhoseiny, Mohamed},
  booktitle={Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing},
  pages={8770--8785},
  year={2022}
}

Contributions

Thanks to @youssef101 for adding this dataset. @Faizan for testing.