Skip to main navigation Skip to search Skip to main content

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

  • University of Cambridge
  • Johns Hopkins University
  • University of Michigan

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

How do product teams evaluate LLM-powered products? As organizations integrate large language models (LLMs) into digital products, their unpredictable nature makes traditional evaluation approaches inadequate, yet little is known about how practitioners navigate this challenge. Through interviews with nineteen practitioners across diverse sectors, we identify ten evaluation practices spanning informal 'vibe checks' to organizational meta-work. Beyond confirming four documented challenges, we introduce a novel fifth we call the results-actionability gap, in which practitioners gather evaluation data but cannot translate findings into concrete improvements. Drawing on patterns from successful teams, we contribute strategies to bridge this gap, supporting practitioners' formalization journey from ad-hoc interpretive practices (e.g., vibe checks) toward systematic evaluation. Our analysis suggests these interpretive practices are necessary adaptations to LLM characteristics rather than methodological failures. For HCI researchers, this presents a research opportunity to support practitioners in systematizing emerging practices rather than developing new evaluation frameworks.
Original languageEnglish
Title of host publicationProceedings of the 2026 CHI Conference on Human Factors in Computing Systems
Number of pages17
Place of PublicationNew York, NY, USA
PublisherAssociation for Computing Machinery
Publication date13 Apr 2026
Pages1-17
ISBN (Electronic)979-8-4007-2278-3
DOIs
Publication statusPublished - 13 Apr 2026
EventConference on Human Factors in Computing Systems - Centre de Convencions Internacional de Barcelona., Barcelona, Spain
Duration: 13 Apr 202617 Apr 2026
https://chi2026.acm.org/

Conference

ConferenceConference on Human Factors in Computing Systems
LocationCentre de Convencions Internacional de Barcelona.
Country/TerritorySpain
CityBarcelona
Period13/04/202617/04/2026
Internet address

Keywords

  • Large language models (LLMs)
  • Evaluation
  • Industry
  • practice
  • Interview study
  • Best practices

Fingerprint

Dive into the research topics of 'Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild'. Together they form a unique fingerprint.

Cite this