Tafsir Dataset: A Novel Multi-Task Benchmark for Named Entity Recognition and Topic Modeling in Classical Arabic Literature

Sajawel Ahmed, Rob van der Goot, Misbahur Rehman, Carl Kruse, Ömer Özsoy, Alexander Mehler, Gemma Roig

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstrakt

Various historical languages, which used to
be lingua franca of science and arts, deserve
the attention of current NLP research. In this
work, we take the first data-driven steps towards this research line for Classical Arabic
(CA) by addressing named entity recognition
(NER) and topic modeling (TM) on the example of CA literature. We manually annotate the encyclopedic work of Tafsir Al-Tabari
with span-based NEs, sentence-based topics,
and span-based subtopics, thus creating the
Tafsir Dataset with over 51,000 sentences, the
first large-scale multi-task benchmark for CA.
Next, we analyze our newly generated dataset,
which we make open-source available, with
current language models (lightweight BiLSTM, transformer-based MaChAmP) along
a novel script compression method, thereby
achieving state-of-the-art performance for our
target task CA-NER. We also show that CA-TM
from the perspective of historical topic models, which are central to Arabic studies, is very
challenging. With this interdisciplinary work,
we lay the foundations for future research on
automatic analysis of CA literature.
OriginalsprogEngelsk
TitelProceedings of the 29th International Conference on Computational Linguistics
Publikationsdatookt. 2022
Sider3753--3768
StatusUdgivet - okt. 2022
Begivenhed29th International Conference on Computational Linguistics -
Varighed: 12 okt. 202217 nov. 2022

Konference

Konference29th International Conference on Computational Linguistics
Periode12/10/202217/11/2022

Fingeraftryk

Dyk ned i forskningsemnerne om 'Tafsir Dataset: A Novel Multi-Task Benchmark for Named Entity Recognition and Topic Modeling in Classical Arabic Literature'. Sammen danner de et unikt fingeraftryk.

Citationsformater