Where are we Still Split on Tokenization?

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Many Natural Language Processing (NLP) tasks are labeled on the token level, forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often considered to be a solved problem, gold tokenization iscommonly assumed. In this paper, we propose an efficient method fortokenization with subword-based language models, and reflect on the status ofperformance on the tokenization task by evaluating on 122 languages in 20different scripts. We show that our proposed model performs on par with thestate-of-the-art, and that tokenization performance is mainly dependent on theamount and consistency of annotated data. We conclude that besidesinconsistencies in the data and exceptional cases the task can be consideredsolved for Latin languages for in-dataset settings (textgreater99.5 F1). However,performance is 0.75 F1 point lower on average for datasets in other scripts andperformance deteriorates in cross-dataset setups.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics: EACL 2024
EditorsYvette Graham, Matthew Purver
Number of pages20
Place of PublicationSt. Julian's, Malta
PublisherAssociation for Computational Linguistics
Publication date1 Mar 2024
Pages118-137
Publication statusPublished - 1 Mar 2024

Fingerprint

Dive into the research topics of 'Where are we Still Split on Tokenization?'. Together they form a unique fingerprint.

Cite this