Abstract
Collocating deep learning training tasks improves GPU utilization but risks resource contention, severe slowdowns, and out-of-memory (OOM) failures. Accurate memory estimation is essential for robust collocation, and GPU utilization estimation — a key proxy for contention — enables interference-aware scheduling.Existing GPU memory estimators span three paradigms -analytical models, CPU-side libraries, and ML-based estimators - each with distinct limitations: dependence on detailed model specifications, intrusive integration, poor generalization, and varying latency overhead. GPU heterogeneity further complicates estimation, as identical tasks can exhibit different memory footprints across hardware generations. GPU utilization remains comparatively understudied, further complicated by the non-additive utilization metrics and GPU heterogeneity.We conduct a systematic analysis of representative memory estimators from each paradigm - Horus [22], PyTorch FakeTensor [3], and our lightweight ML-based estimator - evaluating accuracy, generalizability, and overhead. We construct a synthetic dataset spanning MLPs, CNNs, and Transformers with controlled architectural variations, and train MLP- and Transformer-based estimators for memory prediction, and experiment with utilization estimation. Our evaluation reveals key tradeoffs and validates estimators against real-world unseen models. Significant challenges remain: analytical models lack generalization and cannot easily be extended to new GPU architectures or accurately reflect memory optimization savings; CPU-side libraries impose intrusive integration overhead; and both analytical and ML-based estimators rely on model specifications or computation graphs, limiting generalization across diverse model architectures and GPU hardware variants. We release all datasets, tools, and artifacts to support further research.
| Original language | English |
|---|---|
| Title of host publication | EuroMLSys '26 : Proceedings of the Sixth European Workshop on Machine Learning and Systems |
| Number of pages | 12 |
| Publisher | Association for Computing Machinery |
| Publication date | 2026 |
| Pages | 127–138 |
| ISBN (Print) | 979-8-4007-2605-7 |
| DOIs | |
| Publication status | Published - 2026 |
| Event | Computer Systems - Edinburgh, United Kingdom Duration: 27 Apr 2026 → 30 Apr 2026 Conference number: 21 |
Conference
| Conference | Computer Systems |
|---|---|
| Number | 21 |
| Country/Territory | United Kingdom |
| City | Edinburgh |
| Period | 27/04/2026 → 30/04/2026 |
| Series | EuroMLSys '26 |
|---|
Keywords
- GPU memory estimation
- deep learning training
- resource management
- workload collocation
Fingerprint
Dive into the research topics of 'GPU Memory and Utilization Estimation for Training-Aware Resource Management: Opportunities and Limitations'. Together they form a unique fingerprint.-
RAD: Resource-Aware Data Science
Tözün, P. (PI), Yousefzadeh-Asl-Miandoab, E. (CoI) & Robroek, T. T. (CoI)
Independent Research Fund Denmark
01/04/2021 → 30/04/2025
Project: Research
-
DEEP: Deep Learning Resource-Efficient GPU Orchestrator
Tözün, P. (PI) & Yousefzadeh-Asl-Miandoab, E. (Collaborator)
Swiss National Science Foundation
01/08/2025 → 31/07/2029
Project: Research
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver