Abstract
This paper presents a case study of discovering and classifying verbs in large web-corpora. Many tasks in natural language processing require corpora containing billions of words, and with such volumes of data co-occurrence extraction becomes one of the performance bottlenecks in the Vector Space Models of computational linguistics. We propose a co-occurrence extraction kernel based on ternary trees as an alternative (or a complimentary stage) to conventional map-reduce based approach, this kernel achieves an order of magnitude improvement in memory footprint and processing speed. Our classifier successfully and efficiently identified verbs in a 1.2-billion words untagged corpus of Russian fiction and distinguished between their two aspectual classes. The model proved efficient even for low-frequency vocabulary, including nonce verbs and neologisms.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS) |
| Number of pages | 8 |
| Publication date | 2015 |
| Pages | 61-68 |
| DOIs | |
| Publication status | Published - 2015 |
Keywords
- Verb Classification
- Large Web-Corpora
- Co-occurrence Extraction
- Vector Space Models
- Natural Language Processing
Fingerprint
Dive into the research topics of 'Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver