Spring til hovednavigation Spring til søgning Spring til hovedindhold

An Analysis of Collocation on GPUs for Deep Learning Training

Publikation: Konference artikel i Proceeding eller bog/rapport kapitelKonferencebidrag i proceedingsForskningpeer review

Abstract

Deep learning training is an expensive process that extensively uses GPUs. However, not all model training saturates modern powerful GPUs. To create guidelines for such cases,
this paper examines the performance of the different collocation methods available on NVIDIA GPUs: naïvely submitting multiple processes on the same GPU using multiple streams,
utilizing Multi-Process Service (MPS), and enabling the MultiInstance GPU (MIG). Our results demonstrate that collocating multiple model training runs yields significant benefits, leading to up to three times training throughput despite increased epoch time. On the other hand, the aggregate memory footprint and compute needs of the models trained in parallel must fit the available memory and compute resources of the GPU. MIG can be beneficial thanks to its interference-free partitioning but can suffer from sub-optimal GPU utilization with dynamic or mixed workloads. In general, we recommend MPS as the best-performing and most flexible form of collocation for a single user submitting training jobs.
OriginalsprogEngelsk
TitelProceedings of the 4th Workshop on Machine Learning and Systems, EuroMLSys 2024, Athens, Greece, 22 April 2024
Antal sider10
ForlagAssociation for Computing Machinery
Publikationsdato22 apr. 2024
Sider81-90
ISBN (Trykt)9798400705410
DOI
StatusUdgivet - 22 apr. 2024
BegivenhedWorkshop on Machine Learning and Systems - Athens, Grækenland
Varighed: 22 apr. 202422 apr. 2024
Konferencens nummer: 4
https://dblp.org/rec/conf/euromlsys/2024.html

Workshop

WorkshopWorkshop on Machine Learning and Systems
Nummer4
Land/OmrådeGrækenland
ByAthens
Periode22/04/202422/04/2024
Internetadresse

Emneord

  • Deep learning training
  • GPU utilization
  • Collocation methods
  • NVIDIA GPUs
  • Multi-Process Service (MPS)
  • Multi-Instance GPU (MIG)
  • Training throughput
  • Model training
  • Memory footprint
  • Compute resources

Fingeraftryk

Dyk ned i forskningsemnerne om 'An Analysis of Collocation on GPUs for Deep Learning Training'. Sammen danner de et unikt fingeraftryk.

Citationsformater