When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish

Martin Kroon, Masha Medvedeva, Barbara Plank

    Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

    Abstract

    In this paper we present the results of our participation in the Discriminating between Dutch and Flemish in Subtitles VarDial 2018 shared task. We try techniques proven to work well for discriminating between language varieties as well as explore the potential of using syntactic features, i.e. hierarchical syntactic subtrees. We experiment with different combinations of features. Discriminating between these two languages turned out to be a very hard task, not only for a machine: human performance is only around 0.51 F1 score; our best system is still a simple Naive Bayes model with word unigrams and bigrams. The system achieved an F1 score (macro)
    of 0.62, which ranked us 4th in the shared task.
    OriginalsprogEngelsk
    TitelProceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
    ForlagAssociation for Computational Linguistics
    Publikationsdato2018
    Sider244-253
    ISBN (Trykt)978-1-948087-55-1
    StatusUdgivet - 2018

    Emneord

    • Dutch-Flemish Discrimination
    • VarDial 2018
    • Syntactic Features
    • Naive Bayes Classifier
    • Language Variety Classification

    Fingeraftryk

    Dyk ned i forskningsemnerne om 'When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish'. Sammen danner de et unikt fingeraftryk.

    Citationsformater