Abstract
Many Natural Language Processing (NLP) tasks are labeled on the token level, forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often considered to be a solved problem, gold tokenization iscommonly assumed. In this paper, we propose an efficient method fortokenization with subword-based language models, and reflect on the status ofperformance on the tokenization task by evaluating on 122 languages in 20different scripts. We show that our proposed model performs on par with thestate-of-the-art, and that tokenization performance is mainly dependent on theamount and consistency of annotated data. We conclude that besidesinconsistencies in the data and exceptional cases the task can be consideredsolved for Latin languages for in-dataset settings (textgreater99.5 F1). However,performance is 0.75 F1 point lower on average for datasets in other scripts andperformance deteriorates in cross-dataset setups.
Originalsprog | Engelsk |
---|---|
Titel | Findings of the Association for Computational Linguistics: EACL 2024 |
Redaktører | Yvette Graham, Matthew Purver |
Antal sider | 20 |
Udgivelsessted | St. Julian's, Malta |
Forlag | Association for Computational Linguistics |
Publikationsdato | 1 mar. 2024 |
Sider | 118-137 |
Status | Udgivet - 1 mar. 2024 |