Abstract
Data science has experienced large-scale and rapid development over the last decade. The main drivers of this development are the availability of a large amount of data, periodically growing computation power, and improving learning and data analysis algorithms. The extensive adoption of deep learning in this field demands computational support for the training phase of the models. For this purpose, enterprises share GPU clusters among different production teams to increase GPU utilization. However, there is sub-optimal utilization of such clusters. This is due to (1) the lack of fine-grained sharing mechanisms of GPUs, (2) scheduling tasks as black boxes, which considers no knowledge about resource requirements of the task.
In this thesis, we start by determining the right set of monitoring tools and metrics that is relevant when reasoning about the hardware utilization of deep learning training. Then, we study collocating deep learning training tasks with the available capabilities of NVIDIA GPUs (GPU streams, MPS, and MIG) to investigate its impact on GPU utilization. While our results emphasize the benefits of collocation, it also demonstrates the challenge of fitting within the available GPU memory as the tasks of different deep learning models collocate. Therefore, as a next step, we propose a machine learning-based mechanism to estimate the GPU memory consumption of deep learning model architectures during training. The estimations are helpful for cluster schedulers and resource managers to map training tasks to processors more efficiently.
In the final step, we build a flexible resource manager that provides automated workload collocation allowing different collocation and scheduling options to the end-users. Furthermore, we perform a vast range of experiments and provide the community with our findings and insights. The findings of this thesis have significant implications for resource-efficient deep learning. As deep learning models continue to grow in size and complexity, efficient GPU utilization becomes a critical challenge. By enabling better workload collocation and accurate GPU memory estimation, this research contributes to reducing wasted computational resources, improving throughput, and making deep learning training more sustainable. These insights are particularly valuable for cloud service providers, research institutions, and enterprises that rely on shared GPU clusters, ensuring that deep learning training workloads run more efficiently, cost-effectively, and with minimal resource wastage.
In this thesis, we start by determining the right set of monitoring tools and metrics that is relevant when reasoning about the hardware utilization of deep learning training. Then, we study collocating deep learning training tasks with the available capabilities of NVIDIA GPUs (GPU streams, MPS, and MIG) to investigate its impact on GPU utilization. While our results emphasize the benefits of collocation, it also demonstrates the challenge of fitting within the available GPU memory as the tasks of different deep learning models collocate. Therefore, as a next step, we propose a machine learning-based mechanism to estimate the GPU memory consumption of deep learning model architectures during training. The estimations are helpful for cluster schedulers and resource managers to map training tasks to processors more efficiently.
In the final step, we build a flexible resource manager that provides automated workload collocation allowing different collocation and scheduling options to the end-users. Furthermore, we perform a vast range of experiments and provide the community with our findings and insights. The findings of this thesis have significant implications for resource-efficient deep learning. As deep learning models continue to grow in size and complexity, efficient GPU utilization becomes a critical challenge. By enabling better workload collocation and accurate GPU memory estimation, this research contributes to reducing wasted computational resources, improving throughput, and making deep learning training more sustainable. These insights are particularly valuable for cloud service providers, research institutions, and enterprises that rely on shared GPU clusters, ensuring that deep learning training workloads run more efficiently, cost-effectively, and with minimal resource wastage.
| Originalsprog | Engelsk |
|---|---|
| Kvalifikation | Ph.d. |
| Bevilgende institution |
|
| Vejleder(e) |
|
| Bevillingsdato | 9 apr. 2025 |
| Udgiver | |
| ISBN'er, trykt | 978-87-7949-542-5 |
| ISBN'er, elektronisk | 978-87-7949-560-9 |
| Status | Udgivet - 2025 |