TY - GEN
T1 - High-dimensional density-based clustering using locality-sensitive hashing
AU - Okkels, Camilla Birch
AU - Aumüller, Martin
AU - Thomsen, Viktor Bello
AU - Zimek, Arthur
N1 - DBLP License: DBLP's bibliographic metadata records provided through http://dblp.org/ are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.
PY - 2025
Y1 - 2025
N2 - The DBSCAN algorithm is a popular density-based clustering method to find clusters of arbitrary shapes without requiring an initial guess on the number of clusters. While there are methods to run DBSCAN efficiently in low-dimensional data in near-linear time, there remains a need for an efficient DBSCAN algorithm that scales to high-dimensional data. The bottleneck in highdimensional data is that the range queries necessary in carrying out the algorithm suffer from the curse of dimensionality. In this paper we present the SRRDBSCAN algorithm. This algorithm is an implementation of approximate DBSCAN using locality-sensitive hashing. We prove sub-quadratic running time bounds under reasonable assumptions about the data. An important ingredient in the design of the data structure is the use of a multi-level LSH data structure, which automatically adapts to the density of data points. An extensive empirical analysis shows that the approximation does not significantly impact the quality of the clustering found by the algorithm as compared to the exact DBSCAN clustering. Moreover, our algorithm is competitive with other approaches even in low-dimensional settings, and thus provides a general-purpose DBSCAN implementation for arbitrary data.
AB - The DBSCAN algorithm is a popular density-based clustering method to find clusters of arbitrary shapes without requiring an initial guess on the number of clusters. While there are methods to run DBSCAN efficiently in low-dimensional data in near-linear time, there remains a need for an efficient DBSCAN algorithm that scales to high-dimensional data. The bottleneck in highdimensional data is that the range queries necessary in carrying out the algorithm suffer from the curse of dimensionality. In this paper we present the SRRDBSCAN algorithm. This algorithm is an implementation of approximate DBSCAN using locality-sensitive hashing. We prove sub-quadratic running time bounds under reasonable assumptions about the data. An important ingredient in the design of the data structure is the use of a multi-level LSH data structure, which automatically adapts to the density of data points. An extensive empirical analysis shows that the approximation does not significantly impact the quality of the clustering found by the algorithm as compared to the exact DBSCAN clustering. Moreover, our algorithm is competitive with other approaches even in low-dimensional settings, and thus provides a general-purpose DBSCAN implementation for arbitrary data.
KW - Clustering
KW - approximate nearest neighbor algorithms
KW - DBSCAN
KW - Locality-Sensitive Hashing (LSH)
UR - https://openproceedings.org/2025/conf/edbt/paper-208.pdf
U2 - 10.48786/EDBT.2025.56
DO - 10.48786/EDBT.2025.56
M3 - Article in proceedings
SP - 694
EP - 706
BT - Advances in Database Technology - EDBT
ER -