TY - GEN
T1 - DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
AU - Damme, Patrick
AU - Birkenbach, Marius
AU - Bitsakos, Constantinos
AU - Boehm, Matthias
AU - Bonnet, Philippe
AU - Ciorba, Florina
AU - Dokter, Mark
AU - Dowgiallo, Pawl
AU - Eleliemy, Ahmed
AU - Faerber, Christian
AU - Goumas, Georgios
AU - Habich, Dirk
AU - Hedam, Niclas
AU - Hofer, Marlies
AU - Huang, Wenjun
AU - Innerebner, Kevin
AU - Karakostas, Vasileios
AU - Kern, Roman
AU - Kosar, Tomaž
AU - Krause, Alexander
AU - Krems, Daniel
AU - Laber, Andreas
AU - Lehner, Wolfgang
AU - Mier, Eric
AU - Paradies, Marcus
AU - Peischl, Bernhard
AU - Poerwawinata, Gabrielle
AU - Psomadakis, Stratos
AU - Rabl, Tilmann
AU - Ratuszniak, Piotr
AU - Silva, Pedro
AU - Skuppin, Nikolai
AU - Starzacher, Andreas
AU - Steinwender, Benjamin
AU - Tolovski, Ilin
AU - Tözün, Pinar
AU - Ulatowski, Wojciech
AU - Wang, Yuanyuan
AU - Wrosz, Izajasz
AU - Zamuda, Aleš
AU - Zhang, Ce
AU - Zhu, Xiao Xiang
N1 - No publisher listed. (jcg: 14/02/2022)
In the call for papers it is stated that "Final versions of accepted submissions will be published in the electronic proceedings of the CIDR conference.". Please insert these proceedings as the place of publication (jcg: 22/02/2022)
PY - 2022/1/9
Y1 - 2022/1/9
N2 - Integrated data analysis (IDA) pipelines---that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring---become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used---increasingly heterogeneous---hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results.
AB - Integrated data analysis (IDA) pipelines---that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring---become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used---increasingly heterogeneous---hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results.
KW - Integrated Data Analysis
KW - High-Performance Computing
KW - Machine Learning Pipelines
KW - DAPHNE System
KW - Vectorized Execution Engine
M3 - Article in proceedings
BT - Conference on Innovative Data Systems Research
CY - Santa Cruz, California, USA
ER -