Profiling and Monitoring Deep Learning Training Tasks

Research output: Conference Article in Proceeding or Book/Report chapterArticle in proceedingsResearchpeer-review

Abstract

The embarrassingly parallel nature of deep learning training tasks makes CPU-GPU co-processors the primary commodity hardware for them. The computing and memory requirements of these tasks, however, do not always align well with the available GPU resources. It is, therefore, important to monitor and profile the behavior of training tasks on co-processors to understand better the requirements of different use cases. In this paper, our goal is to shed more light on the variety of tools for profiling and monitoring deep learning training tasks on server-grade NVIDIA GPUs. In addition to surveying the main characteristics of the tools, we analyze the functional limitations and overheads of each tool by using a both light and heavy training scenario. Our results show that monitoring tools like nvidia-smi and dcgm can be integrated with resource managers for online decision making thanks to their low overheads. On the other hand, one has to be careful about the set of metrics to correctly reason about the GPU utilization. When it comes to profiling, each tool has its time to shine; a framework-based or system-wide GPU profiler can first detect the frequent kernels or bottlenecks, and then, a lower-level GPU profiler can focus on particular kernels at the micro-architectural-level.
Original languageEnglish
Title of host publicationProceedings of the 3rd Workshop on Machine Learning and Systems
Number of pages7
Place of PublicationNew York
PublisherAssociation for Computing Machinery
Publication date8 May 2023
Pages18-25
ISBN (Electronic)979-8-4007-0084-2
DOIs
Publication statusPublished - 8 May 2023
EventMachine Learning and Systems - Italy, Rome, Italy
Duration: 8 May 20238 May 2023
Conference number: 3
https://2023.euromlsys.eu/

Conference

ConferenceMachine Learning and Systems
Number3
LocationItaly
Country/TerritoryItaly
CityRome
Period08/05/202308/05/2023
Internet address

Keywords

  • Deep Learning
  • CPU-GPU Co-processors
  • Training Task Profiling
  • NVIDIA GPUs
  • Performance Monitoring

Fingerprint

Dive into the research topics of 'Profiling and Monitoring Deep Learning Training Tasks'. Together they form a unique fingerprint.

Cite this