High-dimensional density-based clustering using locality-sensitive hashing

Camilla Birch Okkels, Martin Aumüller, Viktor Bello Thomsen, Arthur Zimek

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

The DBSCAN algorithm is a popular density-based clustering method to find clusters of arbitrary shapes without requiring an initial guess on the number of clusters. While there are methods to run DBSCAN efficiently in low-dimensional data in near-linear time, there remains a need for an efficient DBSCAN algorithm that scales to high-dimensional data. The bottleneck in highdimensional data is that the range queries necessary in carrying out the algorithm suffer from the curse of dimensionality. In this paper we present the SRRDBSCAN algorithm. This algorithm is an implementation of approximate DBSCAN using locality-sensitive hashing. We prove sub-quadratic running time bounds under reasonable assumptions about the data. An important ingredient in the design of the data structure is the use of a multi-level LSH data structure, which automatically adapts to the density of data points. An extensive empirical analysis shows that the approximation does not significantly impact the quality of the clustering found by the algorithm as compared to the exact DBSCAN clustering. Moreover, our algorithm is competitive with other approaches even in low-dimensional settings, and thus provides a general-purpose DBSCAN implementation for arbitrary data.
OriginalsprogEngelsk
TitelAdvances in Database Technology - EDBT
Antal sider13
Publikationsdato2025
Sider694-706
DOI
StatusUdgivet - 2025

Emneord

  • Clustering
  • approximate nearest neighbor algorithms
  • DBSCAN
  • Locality-Sensitive Hashing (LSH)

Fingeraftryk

Dyk ned i forskningsemnerne om 'High-dimensional density-based clustering using locality-sensitive hashing'. Sammen danner de et unikt fingeraftryk.

Citationsformater