Abstract
In this thesis we explore different uses of techniques for approximate nearest neighbor search (ANN search). Finding a set of close points is necessary for many diverse tasks such as clustering and outlier detection, but finding the exact closest points with guarantee can be expensive, often quadratic in running time complexity. Considering the increasing size of data available, this is a problem that will only get more relevant as time goes on.
ANN search can be used to speed up the process of finding near neighbors at the cost of accuracy by introducing a degree of approximation.
We introduce implementations of algorithms using Locality Sensitive Hashing (LSH) and Hierarchical Navigable Small Worlds (HNSW) for ANN search for different tasks and explore the tradeoffs they provide between accuracy and running time. The main contributions of the thesis are:
• We introduce an implementation of an LSH-based algorithm for approximate DBSCAN clustering. Fast algorithms for DBSCAN clustering exist in literature, but they often lack theoretical guarantees or rely on the dataset being low-dimensional to work optimally. The proposed algorithm was compared to other
state-of-the-art DBSCAN algorithms. We show that while other algorithms failed in some settings for different reasons, ours was consistently either the fastest implementation or performing competitively.
• We explore the benefits and limitations of HNSW graphs for outlier detection. We evaluate direct implementations that compute thekNNand LOF scores fromapproximate neighborhoods aswell as white-box methods. The white-box methods compute outlier-scores directly from the underlying graph and serve as
an example to break newground to make more task-aware tools based on ANN search techniques.
• We present three algorithms for approximate Single-Linkage clustering. As with DBSCAN, the issue lies in the quadratic scaling of the problem. All three utilize HNSW graphs to varying degrees of approximation. We explore their respective tradeoffs in accuracy vs. running time as well as compare how they
perform against exact implementations. One of the algorithms in particular explores recent methods using heap-of-searchers for incremental ANN search.
ANN search can be used to speed up the process of finding near neighbors at the cost of accuracy by introducing a degree of approximation.
We introduce implementations of algorithms using Locality Sensitive Hashing (LSH) and Hierarchical Navigable Small Worlds (HNSW) for ANN search for different tasks and explore the tradeoffs they provide between accuracy and running time. The main contributions of the thesis are:
• We introduce an implementation of an LSH-based algorithm for approximate DBSCAN clustering. Fast algorithms for DBSCAN clustering exist in literature, but they often lack theoretical guarantees or rely on the dataset being low-dimensional to work optimally. The proposed algorithm was compared to other
state-of-the-art DBSCAN algorithms. We show that while other algorithms failed in some settings for different reasons, ours was consistently either the fastest implementation or performing competitively.
• We explore the benefits and limitations of HNSW graphs for outlier detection. We evaluate direct implementations that compute thekNNand LOF scores fromapproximate neighborhoods aswell as white-box methods. The white-box methods compute outlier-scores directly from the underlying graph and serve as
an example to break newground to make more task-aware tools based on ANN search techniques.
• We present three algorithms for approximate Single-Linkage clustering. As with DBSCAN, the issue lies in the quadratic scaling of the problem. All three utilize HNSW graphs to varying degrees of approximation. We explore their respective tradeoffs in accuracy vs. running time as well as compare how they
perform against exact implementations. One of the algorithms in particular explores recent methods using heap-of-searchers for incremental ANN search.
| Originalsprog | Engelsk |
|---|---|
| Vejleder(e) |
|
| Status | Udgivet - 2026 |