Abstract
This paper presents RuSentiment, a new dataset for sentiment analysis of social media posts in Russian, and a new set of comprehensive annotation guidelines that are extensible to other languages. RuSentiment is currently the largest in its class for Russian, with 31,185 posts annotated with Fleiss’ kappa of 0.58 (3 annotations per post). To diversify the dataset, 6,950 posts were pre-selected with an active learning-style strategy. We report baseline classification results, and we also release the best-performing embeddings trained on 3.2B tokens of Russian VKontakte posts.
Original language | English |
---|---|
Title of host publication | Proceedings of the 27th International Conference on Computational Linguistics |
Number of pages | 9 |
Place of Publication | Santa Fe, New Mexico, USA |
Publisher | Association for Computational Linguistics |
Publication date | 2018 |
Pages | 755-763 |
Publication status | Published - 2018 |
Keywords
- RuSentiment dataset
- sentiment analysis
- social media
- Russian language
- annotation guidelines