The Problems of LLM-generated Data in Social Science Research

Luca Rossi, Katherine Harrison, Irina Shklovski

Research output: Journal Article or Conference Article in JournalJournal articleResearchpeer-review

Abstract

Beyond being used as fast and cheap annotators for otherwise complex classification tasks, LLMs have seen a growing adoption for generating synthetic data for social science and design research. Researchers have used LLM-generated data for data augmentation and prototyping, as well as for direct analysis where LLMs acted as proxies for real human subjects. LLM-based synthetic data build on fundamentally different epistemological assumptions than previous synthetically generated data and are justified by a different set of considerations. In this essay, we explore the various ways in which LLMs have been used to generate research data and consider the underlying epistemological (and accompanying methodological) assumptions. We challenge some of the assumptions made about LLM-generated data, and we highlight the main challenges that social sciences and humanities need to address if they want to adopt LLMs as synthetic data generators.
Original languageEnglish
JournalSociologica
Volume18
Issue number2
Pages (from-to)145-168
Number of pages24
ISSN1971-8853
DOIs
Publication statusPublished - 30 Oct 2024

Keywords

  • LLM
  • synthetic data
  • social science
  • research methods

Fingerprint

Dive into the research topics of 'The Problems of LLM-generated Data in Social Science Research'. Together they form a unique fingerprint.

Cite this