Abstract
We are experiencing a big data deluge, a result of not only the internetization and computerization of our society, but also the fast development of affordable and powerful data collection and storage devices. The explosively growing data, in both size and form, has posed a fundamental challenge for big data analytics. That is how to efficiently handle and analyze such big data in order to bridge the gap between data and information.
In wide range of application domains, data are represented as high-dimensional vectors in the Euclidean space in order to benefit from computationally advanced techniques from numerical linear algebra. The computational efficiency and scalability of such techniques have been growing demands for not only novel platform system architectures but also efficient and effective algorithms to address the fast paced big data needs.
In the thesis we will tackle the challenges of big data analytics in the algorithmic aspects. Our solutions have leveraged simple but fast randomized numerical linear algebra techniques to approximate fundamental data relationships, such as data norm, pairwise Euclidean distances and dot products, etc. Such relevant and useful approximation properties will be used to solve fundamental data analysis tasks, including outlier detection, classification and similarity search.
The main contribution of the PhD dissertation is the demonstration of the power of randomization in big data analytics. We illustrate a happy marriage between randomized algorithms and large-scale data analysis in data mining, machine learning and information retrieval. In particular,
• We introduced FastVOA, a near-linear time algorithm to approximate the variance of angles between pairs of data points, a robust outlier score to detect high-dimensional outlier patterns.
• We proposed Tensor Sketch, a fast random feature mapping for approximating non-linear kernels and accelerating the training kernel machines for large-scale classification problems.
• We presented Odd Sketch, a space-efficient probabilistic data structure for estimating high Jaccard similarities between sets, a central problem in information retrieval applications.
The proposed randomized algorithms are not only simple and easy to program, but also well suited to massively parallel computing environments so that we can exploit distributed parallel architectures for big data. In future we hope to exploit the power of randomness not only on the algorithmic aspects but also on the platform system architectures for big data analytic.
In wide range of application domains, data are represented as high-dimensional vectors in the Euclidean space in order to benefit from computationally advanced techniques from numerical linear algebra. The computational efficiency and scalability of such techniques have been growing demands for not only novel platform system architectures but also efficient and effective algorithms to address the fast paced big data needs.
In the thesis we will tackle the challenges of big data analytics in the algorithmic aspects. Our solutions have leveraged simple but fast randomized numerical linear algebra techniques to approximate fundamental data relationships, such as data norm, pairwise Euclidean distances and dot products, etc. Such relevant and useful approximation properties will be used to solve fundamental data analysis tasks, including outlier detection, classification and similarity search.
The main contribution of the PhD dissertation is the demonstration of the power of randomization in big data analytics. We illustrate a happy marriage between randomized algorithms and large-scale data analysis in data mining, machine learning and information retrieval. In particular,
• We introduced FastVOA, a near-linear time algorithm to approximate the variance of angles between pairs of data points, a robust outlier score to detect high-dimensional outlier patterns.
• We proposed Tensor Sketch, a fast random feature mapping for approximating non-linear kernels and accelerating the training kernel machines for large-scale classification problems.
• We presented Odd Sketch, a space-efficient probabilistic data structure for estimating high Jaccard similarities between sets, a central problem in information retrieval applications.
The proposed randomized algorithms are not only simple and easy to program, but also well suited to massively parallel computing environments so that we can exploit distributed parallel architectures for big data. In future we hope to exploit the power of randomness not only on the algorithmic aspects but also on the platform system architectures for big data analytic.
Originalsprog | Engelsk |
---|
Udgivelsessted | IT University of Copenhagen |
---|---|
Forlag | IT-Universitetet i København |
Antal sider | 117 |
ISBN (Trykt) | 978-87-7949-300-1 |
Status | Udgivet - 7 okt. 2014 |
Navn | ITU-DS |
---|---|
Nummer | 105 |
Vol/bind | 1602-3536 |
ISSN | 1602-3536 |