Integrated Data Analysis Pipelines for Large-Scale Data Management, High-Performance Computing, and Machine Learning

Projekter: ProjektForskning

Projektdetaljer

Beskrivelse

Over the last decade, increasing digitization efforts, sensor-equipped everything, and feedback loops for data acquisition led to increasing data sizes and a wide variety of valuable, but heterogeneous data sources. Modern data-driven applications from almost every domain aim to leverage these large data collections in order to find interesting patterns and build robust machine learning (ML) models. Together, large data sizes and complex analysis requirements, spurred the development and adaption of data-parallel computation frameworks like Apache Spark, Flink and Beam, as well as distributed ML systems like Spark MLlib, TensorFlow and PyTorch. A key observation is that these new distributed systems share many compilation and runtime techniques with traditional high performance computing (HPC) systems, but are geared toward distributed data analysis pipelines and ML model training and scoring. Similarly, the cluster hardware for these systems seems to converge more and more (two-socket servers, partially equipped with GPUs and custom ASICs). Yet, the used software stacks, related programming paradigms, and data formats and representations differ substantially, which can be attributed to different research communities. Interestingly, there is a trend toward complex data analysis pipelines that combine these different systems. Examples are workflows that leverage data-parallel data integration, cleaning, and preprocessing, tuned HPC libraries for sub tasks, and dedicated ML systems, but also classical HPC applications that leverage ML models for more cost-effective computation without much accuracy degradation. Unfortunately, community efforts alone – like centers, hubs, and workshops – will unlikely consolidate these disconnected software stacks. Therefore, in this project, we aim to assemble a joint consortium from the data management, ML systems, and HPC communities in order to systematically investigate the necessary system infrastructure, language abstractions, compilation and runtime techniques, as well as systems and tools necessary to increase the productivity when building such heterogeneous data analysis pipelines, and eliminating unnecessary performance bottlenecks.
Kort titelDAPHNE
AkronymDAPHNE
StatusAfsluttet
Effektiv start/slut dato01/12/202030/11/2024

Samarbejdspartnere

  • IT-Universitetet i København
  • Know-Center Graz (leder)
  • AVL List GmbH (Projektpartner)
  • Deutsches Zentrum für Luft - und Raumfarth e.V. (Projektpartner)
  • ZURCHER HOCHSCHULE FUR ANGEWANDTE WISSENSCHAFTEN (Projektpartner)
  • Hasso Plattner Institute for Software Systems Engineering (Projektpartner)
  • National Technical University of Athens (Projektpartner)
  • Infineon Technologies Austria AG (Projektpartner)
  • INTEL TECHNOLOGY POLAND SPÓŁKA Z OGRANICZONĄ ODPOWIEDZIALNOŚCIĄ (Projektpartner)
  • KAI Kompetenzzentrum Automobil- und Industrieelektronik GmbH (Projektpartner)
  • TU Dresden (Projektpartner)
  • University of Maribor - Faculty of Electrical Engineering and Computer Science (Projektpartner)
  • University of Basel (Projektpartner)
  • Erevnitiko Panepistimiako Institouto Systimaton Epikoinonion Kai Ypologiston (Projektpartner)
  • Technical University of Berlin (Projektpartner)

Finansiering

  • European Commission: 49.242.004,00 kr.

Emneord

  • AI, HPC, Data Management

Fingerprint

Udforsk forskningsemnerne, som dette projekt berører. Disse etiketter er oprettet på grundlag af de underliggende bevillinger/legater. Sammen danner de et unikt fingerprint.