Tafsir Dataset: A Novel Multi-Task Benchmark for Named Entity Recognition and Topic Modeling in Classical Arabic Literature

Sajawel Ahmed, Rob van der Goot, Misbahur Rehman, Carl Kruse, Ömer Özsoy, Alexander Mehler, Gemma Roig

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Various historical languages, which used to
be lingua franca of science and arts, deserve
the attention of current NLP research. In this
work, we take the first data-driven steps towards this research line for Classical Arabic
(CA) by addressing named entity recognition
(NER) and topic modeling (TM) on the example of CA literature. We manually annotate the encyclopedic work of Tafsir Al-Tabari
with span-based NEs, sentence-based topics,
and span-based subtopics, thus creating the
Tafsir Dataset with over 51,000 sentences, the
first large-scale multi-task benchmark for CA.
Next, we analyze our newly generated dataset,
which we make open-source available, with
current language models (lightweight BiLSTM, transformer-based MaChAmP) along
a novel script compression method, thereby
achieving state-of-the-art performance for our
target task CA-NER. We also show that CA-TM
from the perspective of historical topic models, which are central to Arabic studies, is very
challenging. With this interdisciplinary work,
we lay the foundations for future research on
automatic analysis of CA literature.
Original languageEnglish
Title of host publicationProceedings of the 29th International Conference on Computational Linguistics
Publication dateOct 2022
Pages3753--3768
Publication statusPublished - Oct 2022
Event29th International Conference on Computational Linguistics -
Duration: 12 Oct 202217 Nov 2022

Conference

Conference29th International Conference on Computational Linguistics
Period12/10/202217/11/2022

Keywords

  • Classical Arabic
  • Named Entity Recognition
  • Topic Modeling
  • Tafsir Al-Tabari
  • Natural Language Processing

Fingerprint

Dive into the research topics of 'Tafsir Dataset: A Novel Multi-Task Benchmark for Named Entity Recognition and Topic Modeling in Classical Arabic Literature'. Together they form a unique fingerprint.

Cite this