Datasets:
caner

Tasks:

Token Classification

Sub-tasks: named-entity-recognition

Languages: Arabic

Multilinguality: monolingual

Size Categories: 100K<n<1M

Language Creators: expert-generated

Annotations Creators: expert-generated

Source Datasets: original

License: unknown

Dataset card Files Files and versions Community

Dataset Viewer

Auto-converted to Parquet

Go to dataset viewer

Viewer

token string	ner_tag class label 21 classes
"الجامع"	1 (Book)
"المسند"	1 (Book)
"الصحيح"	1 (Book)
"المختصر"	1 (Book)
"من"	1 (Book)
"أمور"	1 (Book)
"رسول"	1 (Book)
"الله"	1 (Book)
"صلى"	1 (Book)
"الله"	1 (Book)
"عليه"	1 (Book)
"وسلم"	1 (Book)
"وسننه"	1 (Book)
"وأيامه"	1 (Book)
"صحيح"	1 (Book)
"البخاري"	1 (Book)
"المؤلف"	13 (O)
"محمد"	16 (Pers)
"بن"	16 (Pers)
"إسماعيل"	16 (Pers)
"أبو"	16 (Pers)
"عبد"	16 (Pers)
"الله"	16 (Pers)
"البخاري"	16 (Pers)
"الجعفي"	16 (Pers)
"المحقق"	13 (O)
"محمد"	16 (Pers)
"زهير"	16 (Pers)
"بن"	16 (Pers)
"ناصر"	16 (Pers)
"الناصر"	16 (Pers)
"الناشر"	13 (O)
"دار"	14 (Org)
"طوق"	14 (Org)
"النجاة"	14 (Org)
"مصورة"	13 (O)
"عن"	13 (O)
"السلطانية"	13 (O)
"بإضافة"	13 (O)
"ترقيم"	13 (O)
"ترقيم"	13 (O)
"محمد"	16 (Pers)
"فؤاد"	16 (Pers)
"عبد"	16 (Pers)
"الباقي"	16 (Pers)
"الطبعة"	13 (O)
"الأولى"	13 (O)
"1422"	4 (Date)
"ه"	13 (O)
"عدد"	13 (O)
"الأجزاء"	13 (O)
"9"	12 (Number)
"ترقيم"	13 (O)
"الكتاب"	13 (O)
"موافق"	13 (O)
"للمطبوع"	13 (O)
"وهو"	13 (O)
"ضمن"	13 (O)
"خدمة"	13 (O)
"التخريج"	13 (O)
"ومتن"	13 (O)
"مرتبط"	13 (O)
"بشرحه"	13 (O)
"مع"	13 (O)
"الكتاب"	13 (O)
"شرح"	13 (O)
"وتعليق"	13 (O)
"د"	13 (O)
"مصطفى"	16 (Pers)
"ديب"	16 (Pers)
"البغا"	16 (Pers)
"أستاذ"	13 (O)
"الحديث"	13 (O)
"وعلومه"	13 (O)
"في"	13 (O)
"كلية"	14 (Org)
"الشريعة"	14 (Org)
"جامعة"	14 (Org)
"دمشق"	14 (Org)
"كالتالي"	13 (O)
"رقم"	13 (O)
"الحديث"	13 (O)
"والجزء"	13 (O)
"والصفحة"	13 (O)
"في"	13 (O)
"ط"	13 (O)
"البغا"	16 (Pers)
"يليه"	13 (O)
"تعليقه"	13 (O)
"ثم"	13 (O)
"أطرافه"	13 (O)
"مقدمة"	13 (O)
"د"	13 (O)
"مصطفى"	16 (Pers)
"البغا"	16 (Pers)
"بسم"	13 (O)
"الله"	0 (Allah)
"الرحمن"	0 (Allah)
"الرحيم"	0 (Allah)
"الحمد"	13 (O)

Dataset Card for CANER

Dataset Summary

The Classical Arabic Named Entity Recognition corpus is a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities.

Supported Tasks and Leaderboards

Named Entity Recognition

Languages

Classical Arabic

Dataset Structure

Data Instances

An example from the dataset:

{'ner_tag': 1, 'token': 'الجامع'}

Where 1 stands for "Book"

Data Fields

id: id of the sample
token: the tokens of the example text
ner_tag: the NER tags of each token

The NER tags correspond to this list:

"Allah",
"Book",
"Clan",
"Crime",
"Date",
"Day",
"Hell",
"Loc",
"Meas",
"Mon",
"Month",
"NatOb",
"Number",
"O",
"Org",
"Para",
"Pers",
"Prophet",
"Rlig",
"Sect",
"Time"

Data Splits

Training splits only

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

Ramzi Salah and Lailatul Qadri Zakaria

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

[More Information Needed]

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

@article{article, author = {Salah, Ramzi and Zakaria, Lailatul}, year = {2018}, month = {12}, pages = {}, title = {BUILDING THE CLASSICAL ARABIC NAMED ENTITY RECOGNITION CORPUS (CANERCORPUS)}, volume = {96}, journal = {Journal of Theoretical and Applied Information Technology} }