Integrated Data Analysis Pipelines for Large-Scale Data Management, High-Performance Computing, and Machine Learning

Project: Research

Project Details

Description

Over the last decade, increasing digitization efforts, sensor-equipped everything, and feedback loops for data acquisition led to increasing data sizes and a wide variety of valuable, but heterogeneous data sources. Modern data-driven applications from almost every domain aim to leverage these large data collections in order to find interesting patterns and build robust machine learning (ML) models. Together, large data sizes and complex analysis requirements, spurred the development and adaption of data-parallel computation frameworks like Apache Spark, Flink and Beam, as well as distributed ML systems like Spark MLlib, TensorFlow and PyTorch. A key observation is that these new distributed systems share many compilation and runtime techniques with traditional high performance computing (HPC) systems, but are geared toward distributed data analysis pipelines and ML model training and scoring. Similarly, the cluster hardware for these systems seems to converge more and more (two-socket servers, partially equipped with GPUs and custom ASICs). Yet, the used software stacks, related programming paradigms, and data formats and representations differ substantially, which can be attributed to different research communities. Interestingly, there is a trend toward complex data analysis pipelines that combine these different systems. Examples are workflows that leverage data-parallel data integration, cleaning, and preprocessing, tuned HPC libraries for sub tasks, and dedicated ML systems, but also classical HPC applications that leverage ML models for more cost-effective computation without much accuracy degradation. Unfortunately, community efforts alone – like centers, hubs, and workshops – will unlikely consolidate these disconnected software stacks. Therefore, in this project, we aim to assemble a joint consortium from the data management, ML systems, and HPC communities in order to systematically investigate the necessary system infrastructure, language abstractions, compilation and runtime techniques, as well as systems and tools necessary to increase the productivity when building such heterogeneous data analysis pipelines, and eliminating unnecessary performance bottlenecks.
Short titleDAPHNE
AcronymDAPHNE
StatusActive
Effective start/end date01/12/202030/11/2024

Collaborative partners

  • IT University of Copenhagen
  • Know-Center Graz (lead)
  • AVL List GmbH (Project partner)
  • Deutsches Zentrum für Luft - und Raumfarth e.V. (Project partner)
  • ZURCHER HOCHSCHULE FUR ANGEWANDTE WISSENSCHAFTEN (Project partner)
  • Hasso Plattner Institute (Project partner)
  • National Technical University of Athens (Project partner)
  • Infineon Technologies Austria AG (Project partner)
  • INTEL TECHNOLOGY POLAND SPÓŁKA Z OGRANICZONĄ ODPOWIEDZIALNOŚCIĄ (Project partner)
  • KAI Kompetenzzentrum Automobil- und Industrieelektronik GmbH (Project partner)
  • TU Dresden (Project partner)
  • University of Maribor (Project partner)
  • University of Basel (Project partner)
  • Erevnitiko Panepistimiako Institouto Systimaton Epikoinonion Kai Ypologiston (Project partner)
  • Technical University of Berlin (Project partner)

Funding

  • European Commission: DKK49,242,004.00

Keywords

  • AI, HPC, Data Management

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.