Profiling and Monitoring Deep Learning Training Tasks

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

The embarrassingly parallel nature of deep learning training tasks makes CPU-GPU co-processors the primary commodity hardware for them. The computing and memory requirements of these tasks, however, do not always align well with the available GPU resources. It is, therefore, important to monitor and profile the behavior of training tasks on co-processors to understand better the requirements of different use cases. In this paper, our goal is to shed more light on the variety of tools for profiling and monitoring deep learning training tasks on server-grade NVIDIA GPUs. In addition to surveying the main characteristics of the tools, we analyze the functional limitations and overheads of each tool by using a both light and heavy training scenario. Our results show that monitoring tools like nvidia-smi and dcgm can be integrated with resource managers for online decision making thanks to their low overheads. On the other hand, one has to be careful about the set of metrics to correctly reason about the GPU utilization. When it comes to profiling, each tool has its time to shine; a framework-based or system-wide GPU profiler can first detect the frequent kernels or bottlenecks, and then, a lower-level GPU profiler can focus on particular kernels at the micro-architectural-level.
OriginalsprogEngelsk
TitelProceedings of the 3rd Workshop on Machine Learning and Systems
Antal sider7
UdgivelsesstedNew York
ForlagAssociation for Computing Machinery
Publikationsdato8 maj 2023
Sider18-25
ISBN (Elektronisk)979-8-4007-0084-2
DOI
StatusUdgivet - 8 maj 2023
BegivenhedEuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems - Rome, Italien
Varighed: 8 maj 20238 maj 2023

Konference

KonferenceEuroMLSys '23: Proceedings of the 3rd Workshop on Machine Learning and Systems
Land/OmrådeItalien
ByRome
Periode08/05/202308/05/2023

Fingeraftryk

Dyk ned i forskningsemnerne om 'Profiling and Monitoring Deep Learning Training Tasks'. Sammen danner de et unikt fingeraftryk.

Citationsformater