Python, Performance, and Natural Language Processing

Aleksandr Drozd, Anna Gladkova, Satoshi Matsuoka

    Research output: Journal Article or Conference Article in JournalConference articleResearchpeer-review

    Abstract

    We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
    Original languageEnglish
    JournalPyHPC '15
    Pages (from-to)1:1-1:10
    DOIs
    Publication statusPublished - 2015

    Keywords

    • Python implementation
    • Natural language processing
    • Word classification
    • Vector space models
    • Distributed computing

    Fingerprint

    Dive into the research topics of 'Python, Performance, and Natural Language Processing'. Together they form a unique fingerprint.

    Cite this