TY - GEN
T1 - Python, Performance, and Natural Language Processing
AU - Drozd, Aleksandr
AU - Gladkova, Anna
AU - Matsuoka, Satoshi
PY - 2015
Y1 - 2015
N2 - We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
AB - We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
KW - Python implementation
KW - Natural language processing
KW - Word classification
KW - Vector space models
KW - Distributed computing
KW - Python implementation
KW - Natural language processing
KW - Word classification
KW - Vector space models
KW - Distributed computing
U2 - 10.1145/2835857.2835858
DO - 10.1145/2835857.2835858
M3 - Conference article
SP - 1:1-1:10
JO - PyHPC '15
JF - PyHPC '15
ER -