Projekter pr. år
Abstract
Set similarity join is a fundamental and wellstudied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important  indeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present stateoftheart exact methods are based on prefixfiltering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than stateoftheart exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in highdimensional sketching and indexing that we believe to be of wider relevance to the data engineering community.
Originalsprog  Engelsk 

Titel  Proceedings of IEEE 34th International Conference on Data Engineering (ICDE) 
Forlag  IEEE 
Publikationsdato  2018 
ISBN (Elektronisk)  9781538655207 
DOI  
Status  Udgivet  2018 
Navn  Proceedings of the International Conference on Data Engineering 

ISSN  10636382 
Emneord
 Set similarity join
 Randomized algorithm
 Jaccard similarity
 Highdimensional sketching
 Prefixfiltering
Fingeraftryk
Dyk ned i forskningsemnerne om 'Scalable and Robust Set Similarity Join'. Sammen danner de et unikt fingeraftryk.Projekter
 1 Afsluttet

SSS: Scalable Similarity Search
Pagh, R. (PI), Christiani, T. L. (CoI), Pham, N. D. (CoI), Faithfull, A. (CoI), Silvestri, F. (CoI), Mikkelsen, J. W. (CoI), Sivertsen, J. V. T. (CoI), Aumüller, M. (CoI), Skala, M. (CoI), Ceccarello, M. (CoI), Themsen, R. (CoI), Jacob, R. (CoI), McCauley, S. (CoI) & Ahle, T. D. (CoI)
01/05/2014 → 30/04/2019
Projekter: Projekt › Forskning