Projects per year
Abstract
Set similarity join, as well as the corresponding indexing problem set similarity search, are fundamental primitives for managing noisy or uncertain data. For example, these primitives can be used in data cleaning to identify different representations of the same object. In many cases one can represent an object as a sparse 01 vector, or equivalently as the set of nonzero entries in such a vector. A set similarity join can then be used to identify those pairs that have an exceptionally large dot product (or intersection, when viewed as sets). We choose to focus on identifying vectors with large Pearson correlation, but results extend to other similarity measures. In particular, we consider the indexing problem of identifying correlated vectors in a set S of vectors sampled from 0,1d. Given a query vector y and a parameter alpha in (0,1), we need to search for an alphacorrelated vector x in a data structure representing the vectors of S. This kind of similarity search has been intensely studied in worstcase (nonrandom data) settings. Existing theoretically wellfounded methods for set similarity search are often inferior to heuristics that take advantage of skew in the data distribution, i.e., widely differing frequencies of 1s across the d dimensions. The main contribution of this paper is to analyze the set similarity problem under a random data model that reflects the kind of skewed data distributions seen in practice, allowing theoretical results much stronger than what is possible in worstcase settings. Our indexing data structure is a recursive, datadependent partitioning of vectors inspired by recent advances in set similarity search. Previous datadependent methods do not seem to allow us to exploit skew in item frequencies, so we believe that our work sheds further light on the power of data dependence.
Original language  English 

Title of host publication  Proceedings of Principles of Database Systems (PODS) 
Publisher  Association for Computing Machinery 
Publication date  2018 
ISBN (Electronic)  9781450347068 
DOIs  
Publication status  Published  2018 
Keywords
 Set similarity join
 Indexing problem
 Data cleaning
 Pearson correlation
 Skewed data distributions
Fingerprint
Dive into the research topics of 'Set Similarity Search for Skewed Data'. Together they form a unique fingerprint.Projects
 1 Finished

SSS: Scalable Similarity Search
Pagh, R. (PI), Christiani, T. L. (CoI), Pham, N. D. (CoI), Faithfull, A. (CoI), Silvestri, F. (CoI), Mikkelsen, J. W. (CoI), Sivertsen, J. V. T. (CoI), Aumüller, M. (CoI), Skala, M. (CoI), Ceccarello, M. (CoI), Themsen, R. (CoI), Jacob, R. (CoI), McCauley, S. (CoI) & Ahle, T. D. (CoI)
01/05/2014 → 30/04/2019
Project: Research