Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora

Aleksandr Drozd, Anna Gladkova, Satoshi Matsuoka

    Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

    Abstract

    This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
    Original languageEnglish
    Title of host publicationProceedings of 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS)
    Number of pages8
    Publication date2015
    Pages61-68
    DOIs
    Publication statusPublished - 2015

    Keywords

    • Verb Classification
    • Large Web-Corpora
    • Co-occurrence Extraction
    • Vector Space Models
    • Natural Language Processing

    Fingerprint

    Dive into the research topics of 'Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora'. Together they form a unique fingerprint.

    Cite this