TY - JOUR
T1 - Prototyping a Web-Scale Multimedia Retrieval Service Using Spark
AU - Guðmundsson, Gylfi Þór
AU - Jónsson, Björn Thór
AU - Amsaleg, Laurent
AU - Franklin, Michael J.
PY - 2018/6
Y1 - 2018/6
N2 - The world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids and clouds. Yet it remains a challenge to harness the available power and move towards gracefully searching and retrieving from web-scale media collections. Several researchers have experimented with using automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small computing clusters. In this paper, we describe a prototype of a (near) web-scale throughput-oriented MM retrieval service using the Spark framework running on the AWS cloud service. We present retrieval results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. We also present a publicly available demonstration retrieval system, running on our own servers, where the implementation of the Spark pipelines can be observed in practice using standard image benchmarks, and downloaded for research purposes. Finally, we describe a method to evaluate retrieval quality of the ever-growing high-dimensional index of the prototype, without actually indexing a web-scale media collection.
AB - The world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids and clouds. Yet it remains a challenge to harness the available power and move towards gracefully searching and retrieving from web-scale media collections. Several researchers have experimented with using automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small computing clusters. In this paper, we describe a prototype of a (near) web-scale throughput-oriented MM retrieval service using the Spark framework running on the AWS cloud service. We present retrieval results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. We also present a publicly available demonstration retrieval system, running on our own servers, where the implementation of the Spark pipelines can be observed in practice using standard image benchmarks, and downloaded for research purposes. Finally, we describe a method to evaluate retrieval quality of the ever-growing high-dimensional index of the prototype, without actually indexing a web-scale media collection.
KW - Data production
KW - Media retrieval
KW - Distributed computing
KW - Hadoop
KW - Spark
KW - AWS cloud
KW - High-dimensional feature vectors
KW - SIFT
KW - YFCC 100M
KW - Information retrieval system
U2 - 10.1145/3209662
DO - 10.1145/3209662
M3 - Journal article
SN - 1551-6857
VL - 14
SP - 65:1-65:24
JO - ACM Transactions on Multimedia Computing, Communications, and Applications
JF - ACM Transactions on Multimedia Computing, Communications, and Applications
IS - 3s
M1 - 65
ER -