Query-level evaluation metrics such as nDCG that originate from field of Information Retrieval (IR) have seen widespread adoption in the Recommender Systems (RS) community for comparing the quality of different ranked lists of recommendations with different levels of relevance to the user. However, the traditional (offline) RS evaluation paradigm is typically restricted to evaluating a single results list. In contrast, IR researchers have also developed evaluation metrics over the past decade for the session-based evaluation of more complex search tasks. Here, the sessions consist of multiple queries and multi-round search interactions, and the metrics evaluate the quality of the session as a whole. Despite the popularity of the more traditional single-list evaluation paradigm, RS can also be used to assist users with complex information access tasks. In this paper, we explore the usefulness of session-level evaluation metrics for evaluating and comparing the performance of both recommender systems and search engines. We show that, despite possible misconceptions that comparing both scenarios is akin to comparing apples to oranges, it is indeed possible to compare recommendation results from a single ranked list to the results from a whole search session. In doing so, we address the following questions: (1) how can we fairly and realistically compare the quality of an individual list of recommended items to the quality of an entire manual search session; (2) how can we measure the contribution that the RS is making to the entire search session. We contextualize our claims by focusing on a particular complex search scenario: the problem of talent search. An example of professional search, talent search involves recruiters searching for relevant candidates given a specific job posting by issuing multiple queries in the course of a search session. We show that it is possible to compare the search behavior and success of recruiters to that of a matchmaking recommender system that generates a single ranked list of relevant candidates for a given job posting. In particular, we adopt a session-based metric from IR and motivate how it can be used to perform valid and realistic comparisons of recommendation lists to multiple-query search sessions.
|Tidsskrift||CEUR Workshop Proceedings|
|Status||Udgivet - 2022|
|Begivenhed||2022 Perspectives on the Evaluation of Recommender Systems Workshop, PERSPECTIVES 2022 - Seattle, USA|
Varighed: 22 sep. 2022 → …
|Konference||2022 Perspectives on the Evaluation of Recommender Systems Workshop, PERSPECTIVES 2022|
|Periode||22/09/2022 → …|