Abstract
ABSTRACT
Similarity joins are a basic primitive in data mining. Given two
sets of points, we are interested in reporting all pairs of points
whose similarity is above a user-defined threshold. Solving the
problem naively entails verifying all possible pairs, which can
be infeasible for large inputs. In such contexts, Locality Sensitive Hashing (LSH) is often considered to reduce the number of
pairs to verify. However, while it provides subquadratic running
time, large input sets make it nevertheless necessary to resort
to distributed computing. Hu, Yi, and Tao (PODS’17, TODS’19)
proposed a nearly load-optimal LSH-based join algorithm and
provided a small-scale experimental study in a distributed setting.
This paper provides further analysis of their approach. It shows
that the load-minimizing parameter settings by Hu et al. incur
too much local work, rendering it impractical.
To remedy this drawback, we propose two approaches: The
first distributes work in a data-independent way, while the second
adapts to the data distribution using LSH. Both schemes then use
LSH to solve subproblems locally. This allows to balance load
and the amount of local work.
Through an experimental evaluation, we show that the transition from theory to practice for Hu et al.’s approach is challenging: it is hard to strike a good tradeoff between the load and the
amount of local work of each processor, load balancing is itself
an issue, and LSH may introduce duplicates in the output. Our
extensive experimental evaluation is supported by an efficient
open source implementation of all the methods we test.
Our results highlight the need for an holistic approach: only
focusing on the load, as tradition in the MPC model, might not
make efficient use the available resources and better trade-offs
between local work and load are possible.
Similarity joins are a basic primitive in data mining. Given two
sets of points, we are interested in reporting all pairs of points
whose similarity is above a user-defined threshold. Solving the
problem naively entails verifying all possible pairs, which can
be infeasible for large inputs. In such contexts, Locality Sensitive Hashing (LSH) is often considered to reduce the number of
pairs to verify. However, while it provides subquadratic running
time, large input sets make it nevertheless necessary to resort
to distributed computing. Hu, Yi, and Tao (PODS’17, TODS’19)
proposed a nearly load-optimal LSH-based join algorithm and
provided a small-scale experimental study in a distributed setting.
This paper provides further analysis of their approach. It shows
that the load-minimizing parameter settings by Hu et al. incur
too much local work, rendering it impractical.
To remedy this drawback, we propose two approaches: The
first distributes work in a data-independent way, while the second
adapts to the data distribution using LSH. Both schemes then use
LSH to solve subproblems locally. This allows to balance load
and the amount of local work.
Through an experimental evaluation, we show that the transition from theory to practice for Hu et al.’s approach is challenging: it is hard to strike a good tradeoff between the load and the
amount of local work of each processor, load balancing is itself
an issue, and LSH may introduce duplicates in the output. Our
extensive experimental evaluation is supported by an efficient
open source implementation of all the methods we test.
Our results highlight the need for an holistic approach: only
focusing on the load, as tradition in the MPC model, might not
make efficient use the available resources and better trade-offs
between local work and load are possible.
Original language | English |
---|---|
Title of host publication | EDBT : International Conference on Extending Database Technology |
Number of pages | 13 |
Publication date | 2022 |
DOIs | |
Publication status | Published - 2022 |
Keywords
- Similarity Joins
- Locality Sensitive Hashing (LSH)
- Distributed Computing
- Load Balancing
- Data Mining Methods