Where are we Still Split on Tokenization?

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

Many Natural Language Processing (NLP) tasks are labeled on the token level, forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often considered to be a solved problem, gold tokenization iscommonly assumed. In this paper, we propose an efficient method fortokenization with subword-based language models, and reflect on the status ofperformance on the tokenization task by evaluating on 122 languages in 20different scripts. We show that our proposed model performs on par with thestate-of-the-art, and that tokenization performance is mainly dependent on theamount and consistency of annotated data. We conclude that besidesinconsistencies in the data and exceptional cases the task can be consideredsolved for Latin languages for in-dataset settings (textgreater99.5 F1). However,performance is 0.75 F1 point lower on average for datasets in other scripts andperformance deteriorates in cross-dataset setups.
OriginalsprogEngelsk
TitelFindings of the Association for Computational Linguistics: EACL 2024
RedaktørerYvette Graham, Matthew Purver
Antal sider20
UdgivelsesstedSt. Julian's, Malta
ForlagAssociation for Computational Linguistics
Publikationsdato1 mar. 2024
Sider118-137
StatusUdgivet - 1 mar. 2024

Fingeraftryk

Dyk ned i forskningsemnerne om 'Where are we Still Split on Tokenization?'. Sammen danner de et unikt fingeraftryk.

Citationsformater