Aktiviteter pr. år
Abstract
Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.
Originalsprog | Engelsk |
---|---|
Titel | Proceedings of the 23rd international conference on World wide web : WWW '14 |
Antal sider | 10 |
Forlag | Association for Computing Machinery |
Publikationsdato | 2014 |
Sider | 109-118 |
ISBN (Elektronisk) | 978-1-4503-2744-2 |
DOI | |
Status | Udgivet - 2014 |
Emneord
- Set Similarity
- Odd Sketch
- Jaccard Similarity
- Symmetric Difference
- Web Duplicate Detection
- Collaborative Filtering
- Association Rule Learning
- Weighted Jaccard Similarity
- TF-IDF Vector Comparison
- Minwise Hashing
Fingeraftryk
Dyk ned i forskningsemnerne om 'Efficient estimation for high similarities using odd sketches'. Sammen danner de et unikt fingeraftryk.Aktiviteter
- 1 Andet (priser, ekstern undervisning samt andet). - Priser, stipendier, udnævnelser
-
Best Paper Award
Pham, N. D. (Deltager)
7 apr. 2014 → 11 apr. 2014Aktivitet: Andre aktivitetstyper › Andet (priser, ekstern undervisning samt andet). - Priser, stipendier, udnævnelser
Presse/Medier
-
University student invents algorithm speeds up internet searches
Pham, N. D.
27/11/2014
1 element af Mediedækning
Presse/medie
-
Opfindelse fra Danmark gør computere hurtigere
Pham, N. D.
26/11/2014
1 element af Mediedækning
Presse/medie
Projekter
- 1 Afsluttet
-
MaDaMS: Massive Data Mining by Sampling
Pagh, R. (PI), Stöckel, M. (CoI) & Pham, N. D. (CoI)
01/01/2011 → 31/12/2014
Projekter: Projekt › Forskning