Abstract
Various historical languages, which used to
be lingua franca of science and arts, deserve
the attention of current NLP research. In this
work, we take the first data-driven steps towards this research line for Classical Arabic
(CA) by addressing named entity recognition
(NER) and topic modeling (TM) on the example of CA literature. We manually annotate the encyclopedic work of Tafsir Al-Tabari
with span-based NEs, sentence-based topics,
and span-based subtopics, thus creating the
Tafsir Dataset with over 51,000 sentences, the
first large-scale multi-task benchmark for CA.
Next, we analyze our newly generated dataset,
which we make open-source available, with
current language models (lightweight BiLSTM, transformer-based MaChAmP) along
a novel script compression method, thereby
achieving state-of-the-art performance for our
target task CA-NER. We also show that CA-TM
from the perspective of historical topic models, which are central to Arabic studies, is very
challenging. With this interdisciplinary work,
we lay the foundations for future research on
automatic analysis of CA literature.
be lingua franca of science and arts, deserve
the attention of current NLP research. In this
work, we take the first data-driven steps towards this research line for Classical Arabic
(CA) by addressing named entity recognition
(NER) and topic modeling (TM) on the example of CA literature. We manually annotate the encyclopedic work of Tafsir Al-Tabari
with span-based NEs, sentence-based topics,
and span-based subtopics, thus creating the
Tafsir Dataset with over 51,000 sentences, the
first large-scale multi-task benchmark for CA.
Next, we analyze our newly generated dataset,
which we make open-source available, with
current language models (lightweight BiLSTM, transformer-based MaChAmP) along
a novel script compression method, thereby
achieving state-of-the-art performance for our
target task CA-NER. We also show that CA-TM
from the perspective of historical topic models, which are central to Arabic studies, is very
challenging. With this interdisciplinary work,
we lay the foundations for future research on
automatic analysis of CA literature.
Original language | English |
---|---|
Title of host publication | Proceedings of the 29th International Conference on Computational Linguistics |
Publication date | Oct 2022 |
Pages | 3753--3768 |
Publication status | Published - Oct 2022 |
Event | 29th International Conference on Computational Linguistics - Duration: 12 Oct 2022 → 17 Nov 2022 |
Conference
Conference | 29th International Conference on Computational Linguistics |
---|---|
Period | 12/10/2022 → 17/11/2022 |
Keywords
- Classical Arabic
- Named Entity Recognition
- Topic Modeling
- Tafsir Al-Tabari
- Natural Language Processing