Abstract
In the realm of Computational Social Science (CSS), practitioners often navigate complex, low-resource domains and face the costly and time-intensive challenges of acquiring and annotating data. We aim to establish a set of guidelines to address such challenges, comparing the use of human-labeled data with synthetically generated data from GPT-4 and Llama-2 in ten distinct CSS classification tasks of varying complexity. Additionally, we examine the impact of training data sizes on performance. Our findings reveal that models trained on human-labeled data consistently exhibit superior or comparable performance compared to their synthetically augmented counterparts. Nevertheless, synthetic augmentation proves beneficial, particularly in improving performance on rare classes within multi-class tasks. Furthermore, we leverage GPT-4 and Llama-2 for zero-shot classification and find that, while they generally display strong performance, they often fall short when compared to specialized classifiers trained on moderately sized training sets.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics |
| Place of Publication | St. Julians, Malta |
| Publisher | Association for Computational Linguistics |
| Publication date | Mar 2024 |
| Pages | 179-192 |
| DOIs | |
| Publication status | Published - Mar 2024 |
| Event | Conference of the European Chapter of the Association for Computational Linguistics - St. Julian's, Malta Duration: 17 Mar 2024 → 22 Mar 2024 Conference number: 18 https://dblp.org/db/conf/eacl/eacl2024-2.html https://aclanthology.org/volumes/2024.eacl-long/ https://dblp.org/db/conf/eacl/eacl2024f.html |
Conference
| Conference | Conference of the European Chapter of the Association for Computational Linguistics |
|---|---|
| Number | 18 |
| Country/Territory | Malta |
| City | St. Julian's |
| Period | 17/03/2024 → 22/03/2024 |
| Internet address |
Keywords
- Computational Social Science
- Data Annotation
- Synthetic Data Augmentation
- Zero-shot Classification
- Text Classification
Fingerprint
Dive into the research topics of 'The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver