Abstract
While there has been a lot of work on finding frequent itemsets in transaction data streams, none of these solve the problem of finding similar pairs according to standard similarity measures.
This paper is a first attempt at dealing with this, arguably more important, problem.
We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent itemsets:
Any algorithm that (even only approximately and with a chance of error) finds the most frequent $k$-itemset must use space $\Omega(\min\{mb,n^k,(mb/\varphi)^k\})$ bits, where $mb$ is the number of items in the stream so far, $n$ is the number of distinct items and $\varphi$ is a support threshold.
To achieve any non-trivial space upper bound we must thus abandon a worst-case assumption on the data stream.
We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similarity measures, but the mining accuracy {\em improves\/} with the length of the stream for any fixed support threshold.
This paper is a first attempt at dealing with this, arguably more important, problem.
We start out with a negative result that also explains the lack of theoretical upper bounds on the space usage of data mining algorithms for finding frequent itemsets:
Any algorithm that (even only approximately and with a chance of error) finds the most frequent $k$-itemset must use space $\Omega(\min\{mb,n^k,(mb/\varphi)^k\})$ bits, where $mb$ is the number of items in the stream so far, $n$ is the number of distinct items and $\varphi$ is a support threshold.
To achieve any non-trivial space upper bound we must thus abandon a worst-case assumption on the data stream.
We work under the model that the transactions come in random order, and show that surprisingly, not only is small-space similarity mining possible for the most common similarity measures, but the mining accuracy {\em improves\/} with the length of the stream for any fixed support threshold.
Originalsprog | Engelsk |
---|---|
Titel | KDCloud 2010 : Proceedings of the International Workshop on Knowledge Discovery Using Cloud and Distributed Computing Platforms |
Antal sider | 8 |
Forlag | IEEE Computer Society Press |
Publikationsdato | 14 dec. 2010 |
Status | Udgivet - 14 dec. 2010 |
Begivenhed | IEEE International Conference on Data Mining - Varighed: 2 jul. 2010 → … Konferencens nummer: 9 |
Konference
Konference | IEEE International Conference on Data Mining |
---|---|
Nummer | 9 |
Periode | 02/07/2010 → … |
Emneord
- frequent itemsets
- transaction data streams
- similarity measures
- space complexity
- random order transactions