DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

  • Patrick Damme
  • Marius Birkenbach
  • Constantinos Bitsakos
  • Matthias Boehm
  • Florina Ciorba
  • Mark Dokter
  • Pawl Dowgiallo
  • Ahmed Eleliemy
  • Christian Faerber
  • Georgios Goumas
  • Dirk Habich
  • Marlies Hofer
  • Wenjun Huang
  • Kevin Innerebner
  • Vasileios Karakostas
  • Roman Kern
  • Tomaž Kosar
  • Alexander Krause
  • Daniel Krems
  • Andreas Laber
  • Wolfgang Lehner
  • Eric Mier
  • Marcus Paradies
  • Bernhard Peischl
  • Gabrielle Poerwawinata
  • Stratos Psomadakis
  • Tilmann Rabl
  • Piotr Ratuszniak
  • Pedro Silva
  • Nikolai Skuppin
  • Andreas Starzacher
  • Benjamin Steinwender
  • Ilin Tolovski
  • Wojciech Ulatowski
  • Yuanyuan Wang
  • Izajasz Wrosz
  • Aleš Zamuda
  • Ce Zhang
  • Xiao Xiang Zhu

View graph of relations

Integrated data analysis (IDA) pipelines---that combine data management (DM) and query processing, high-performance computing (HPC), and machine learning (ML) training and scoring---become increasingly common in practice. Interestingly, systems of these areas share many compilation and runtime techniques, and the used---increasingly heterogeneous---hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource management, data formats and representations, as well as execution strategies differ substantially. DAPHNE is an open and extensible system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for increasing productivity and eliminating unnecessary overheads. In this paper, we make a case for IDA pipelines, describe the overall DAPHNE system architecture, its key components, and the design of a vectorized execution engine for computational storage, HW accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas, DuckDB, and TensorFlow show promising results.
Original languageEnglish
Title of host publicationConference on Innovative Data Systems Research
Place of PublicationSanta Cruz, California, USA
Publication date9 Jan 2022
Publication statusPublished - 9 Jan 2022

Bibliographical note

No publisher listed. (jcg: 14/02/2022)
In the call for papers it is stated that "Final versions of accepted submissions will be published in the electronic proceedings of the CIDR conference.". Please insert these proceedings as the place of publication (jcg: 22/02/2022)

ID: 86467838