One of these words is not like the other: a reproduction of outlier identification using non-contextual word representations
Research output: Conference Article in Proceeding or Book/Report chapter › Article in proceedings › Research › peer-review
Furthermore, we present 50-8-8, a new data set for the outlier identification task, which avoids limitations of the original data set, such as ambiguous words, infrequent words, and multi-word tokens, while increasing the number of test cases. The data set is expanded to contain semantic and syntactic tests and is multilingual (English, German, and Italian).
We provide an in-depth analysis of word embedding models with a range of hyper-parameters. Our analysis shows the suitability of different models and hyper-parameters for different tasks and the greater difficulty of representing German and Italian languages.
|Title of host publication||Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing and the 10th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)|
|Number of pages||11|
|Publisher||Association for Computational Linguistics|
|Publication date||Nov 2020|
|Publication status||Published - Nov 2020|