Python, Performance, and Natural Language Processing

Aleksandr Drozd, Anna Gladkova, Satoshi Matsuoka

    Publikation: Artikel i tidsskrift og konference artikel i tidsskriftKonferenceartikelForskningpeer review

    Abstract

    We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
    OriginalsprogEngelsk
    TidsskriftPyHPC '15
    Sider (fra-til)1:1-1:10
    DOI
    StatusUdgivet - 2015

    Emneord

    • Python implementation
    • Natural language processing
    • Word classification
    • Vector space models
    • Distributed computing

    Fingeraftryk

    Dyk ned i forskningsemnerne om 'Python, Performance, and Natural Language Processing'. Sammen danner de et unikt fingeraftryk.

    Citationsformater