Banner Banner

Scalable Data Debugging for Neighborhood-based Recommendation with Data Shapley Values

Barrie Kersbergen
Olivier Sprangers
Bojan Karlaš
Maarten de Rijke
Sebastian Schelter

September 22, 2025

Machine learning-powered recommendation systems help users f ind items they like. Issues in the interaction data processed by these systems frequently lead to problems, e.g., to the accidental recommendation of low-quality products or dangerous items. Such data issues are hard to anticipate upfront, and are typically de tected post-deployment after they have already impacted the user experience. We argue that a principled data debugging process is required during which human experts identify potentially hurtful data issues and preemptively mitigate them. Recent notions of “data importance,” such as the Data Shapley Value (DSV), represent a promising direction to identify training data points likely to cause issues. However, the scale of real-world interaction datasets makes it infeasible to apply existing techniques to compute the DSV in recommendation scenarios. Wetackle this problem by introducing the KMC-Shapley algorithm for the scalable estimation of Data Shapley Values in neighborhood-based recommendation on sparse interaction data. We conduct an experimental evaluation of the efficiency and scalability of our algorithm on both public and proprietary datasets with millions of interactions, and showcase that the DSV identifies impactful data points for two recommendation tasks in e-commerce. Furthermore, wediscuss applications of the DSV on real-world click and purchase data in e-commerce, such as identifying dangerous products or im proving the ecological sustainability of product recommendations.