On the Power of Randomization in Big Data Analytics

Ninh Dang Pham

Research output: Book / Anthology / Report / Ph.D. thesisPh.D. thesis

Abstract

We are experiencing a big data deluge, a result of not only the internetization and computerization of our society, but also the fast development of affordable and powerful data collection and storage devices. The explosively growing data, in both size and form, has posed a fundamental challenge for big data analytics. That is how to efficiently handle and analyze such big data in order to bridge the gap between data and information.

In wide range of application domains, data are represented as high-dimensional vectors in the Euclidean space in order to benefit from computationally advanced techniques from numerical linear algebra. The computational efficiency and scalability of such techniques have been growing demands for not only novel platform system architectures but also efficient and effective algorithms to address the fast paced big data needs.

In the thesis we will tackle the challenges of big data analytics in the algorithmic aspects. Our solutions have leveraged simple but fast randomized numerical linear algebra techniques to approximate fundamental data relationships, such as data norm, pairwise Euclidean distances and dot products, etc. Such relevant and useful approximation properties will be used to solve fundamental data analysis tasks, including outlier detection, classification and similarity search.

The main contribution of the PhD dissertation is the demonstration of the power of randomization in big data analytics. We illustrate a happy marriage between randomized algorithms and large-scale data analysis in data mining, machine learning and information retrieval. In particular,
• We introduced FastVOA, a near-linear time algorithm to approximate the variance of angles between pairs of data points, a robust outlier score to detect high-dimensional outlier patterns.
• We proposed Tensor Sketch, a fast random feature mapping for approximating non-linear kernels and accelerating the training kernel machines for large-scale classification problems.
• We presented Odd Sketch, a space-efficient probabilistic data structure for estimating high Jaccard similarities between sets, a central problem in information retrieval applications.

The proposed randomized algorithms are not only simple and easy to program, but also well suited to massively parallel computing environments so that we can exploit distributed parallel architectures for big data. In future we hope to exploit the power of randomness not only on the algorithmic aspects but also on the platform system architectures for big data analytic.
Original languageEnglish
Place of PublicationIT University of Copenhagen
PublisherIT-Universitetet i København
Number of pages117
ISBN (Print)978-87-7949-300-1
Publication statusPublished - 7 Oct 2014
SeriesITU-DS
Number105
Volume1602-3536
ISSN1602-3536

Fingerprint

Dive into the research topics of 'On the Power of Randomization in Big Data Analytics'. Together they form a unique fingerprint.

Cite this