Scalable and Robust Set Similarity Join

Tobias Lybecker Christiani, Rasmus Pagh, Johan von Tangen Sivertsen

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

Set similarity join is a fundamental and well-studied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be important - indeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in high-dimensional sketching and indexing that we believe to be of wider relevance to the data engineering community.
OriginalsprogEngelsk
TitelProceedings of IEEE 34th International Conference on Data Engineering (ICDE)
ForlagIEEE
Publikationsdato2018
ISBN (Elektronisk)978-1-5386-5520-7
DOI
StatusUdgivet - 2018
NavnProceedings of the International Conference on Data Engineering
ISSN1063-6382

Emneord

  • Set similarity join
  • Randomized algorithm
  • Jaccard similarity
  • High-dimensional sketching
  • Prefix-filtering

Fingeraftryk

Dyk ned i forskningsemnerne om 'Scalable and Robust Set Similarity Join'. Sammen danner de et unikt fingeraftryk.
  • SSS: Scalable Similarity Search

    Pagh, R. (PI), Christiani, T. L. (CoI), Pham, N. D. (CoI), Faithfull, A. (CoI), Silvestri, F. (CoI), Mikkelsen, J. W. (CoI), Sivertsen, J. V. T. (CoI), Aumüller, M. (CoI), Skala, M. (CoI), Ceccarello, M. (CoI), Themsen, R. (CoI), Jacob, R. (CoI), McCauley, S. (CoI) & Ahle, T. D. (CoI)

    European Commission

    01/05/201430/04/2019

    Projekter: ProjektForskning

Citationsformater