Skip to main navigation Skip to search Skip to main content

GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations

  • University of Copenhagen

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation — a key proxy for contention — enables interference-aware scheduling.Existing GPU memory estimators span three paradigms -analytical models, CPU-side libraries, and ML-based estimators - each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by the non-additive utilization metrics and GPU heterogeneity.We conduct a systematic analysis of representative memory estimators from each paradigm - Horus [22], PyTorch FakeTensor [3], and our lightweight ML-based estimator - evaluating accuracy, generalizability, and overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction, and experiment with utilization estimation. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models lack generalization and cannot easily be extended to new GPU architectures or accurately reflect memory optimization savings; CPU-side libraries impose intrusive integration overhead; and both analytical and ML-based estimators rely on model specifications or computation graphs, limiting generalization across diverse model architectures and GPU hardware variants. We release all datasets, tools, and artifacts to support further research.
Original languageEnglish
Title of host publicationEuroMLSys '26 : Proceedings of the Sixth European Workshop on Machine Learning and Systems
Number of pages12
PublisherAssociation for Computing Machinery
Publication date2026
Pages127–138
ISBN (Print)979-8-4007-2605-7
DOIs
Publication statusPublished - 2026
EventComputer Systems - Edinburgh, United Kingdom
Duration: 27 Apr 202630 Apr 2026
Conference number: 21

Conference

ConferenceComputer Systems
Number21
Country/TerritoryUnited Kingdom
CityEdinburgh
Period27/04/202630/04/2026
SeriesEuroMLSys '26

Keywords

  • GPU memory estimation
  • deep learning training
  • resource management
  • workload collocation

Fingerprint

Dive into the research topics of 'GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations'. Together they form a unique fingerprint.

Cite this