Frequent Pairs in Data Streams: Exploiting Parallelism and Skew

Andrea Campagna, Konstantin Kutzkow, Rasmus Pagh

    Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

    Abstract

    We introduce the Pair Streaming Engine (PairSE) that detects frequent pairs in a data stream of transactions. Our algorithm finds the most frequent pairs with high probability, and gives tight bounds on their frequency. It is particularly space efficient for skewed distribution of pair supports, confirmed for several real-world datasets. Additionally, the algorithm parallelizes easily, which opens up for real-time processing of large transactions. Unlike previous algorithms we make no assumptions on the order of arrival of transactions and pairs. Our algorithm builds upon approaches for frequent items mining in data streams. We show how to efficiently scale these approaches to handle large transactions. We report experimental results showcasing precision and recall of our method. In particular, we find that often our method achieves excellent precision, returning identical upper and lower bounds on the supports of the most frequent pairs.
    OriginalsprogEngelsk
    TitelProceedings of IEEE International Conference on Data Mining Workshops: ICDMW 2011
    ForlagIEEE
    Publikationsdato2011
    Sider145 - 150
    ISBN (Trykt)978-1-4673-0005-6
    DOI
    StatusUdgivet - 2011

    Emneord

    • Frequent pair detection
    • Data stream algorithms
    • Space-efficient computation
    • Parallel processing
    • Skewed distribution handling

    Fingeraftryk

    Dyk ned i forskningsemnerne om 'Frequent Pairs in Data Streams: Exploiting Parallelism and Skew'. Sammen danner de et unikt fingeraftryk.

    Citationsformater