Resourceful Learning: Training More Models with Fewer Resources

Publikation: Bog / Antologi / Rapport / Ph.D.-afhandlingPh.d.-afhandling

Abstract

Data Science is a field that has seen rapid development as of late due to the introduction of more powerful and specialised hardware. Massive algorithms such as Deep Neural Networks can be feasibly run on current-day accelerators. This hardware is by design efficient for solving embarrassingly parallel tasks, such as matrix multiplication.

New models are sometimes trained on systems spanning over 10000 nodes with
these accelerators, pushing the state of the art further but incurring massive resource costs [1, 2, 3]. Firstly, larger hardware setups require significant space, data-centre cooling and electricity [4, 5]. Secondly, with the state-of-the art utilising more and more expensive hardware setups, doing groundbreaking research in Machine Learning is becoming more expensive, and by extension, less attainable. It is vital to keep research accessible to not just those with private budgets [4].

Previous research has shown that the price of server infrastructure does not linearly reduce the time to train a Deep Learning model to accuracy. In some cases, models can be trained to similar accuracy with only a slight increase in training time on cheaper hardware [6]. It is thus crucial that we do not simply look at maximum accuracy, but also look at how long it takes us to get there [7, 1].

Paramount to training models in a resource-conscious way is to understand how
Deep Learning training interacts with the hardware. Therefore, we have introduced the data collection and visualisation framework radT to make this information accessible. This allows Deep Learning researchers to make more informed choices on how they use their hardware. Additionally, we have used radT ourselves to run extensive benchmarking of Deep Learning training on a plethora of configurations.

Furthermore, current-day GPU hardware may be more powerful than what is required to train a singular model. Rather than letting the unused resources go to waste, one can train multiple models at the same time on the same GPU (collocation). With radT we investigated the effectiveness of several methods of GPU collocation. This showed that GPU collocation can be very effective when models are small, or when models complement each other’s hardware requirements.

Another way to increase efficiency via collocation is to streamline data loading and processing pipelines. We introduced a system that takes care of data redundancies by decoupling data loading from model training. This way, a server runs a single data loading process that loads data for all models being trained at the same time. This results in significant CPU savings and can even lead to GPU savings in cases where parts of the data pre-processing pipeline run on the GPU.

While the aforementioned projects improve the transparency of resource utilisation and efficiency via collocation, they do not improve the training performance of a model training in isolation. Our final contribution makes use of our data loading expertise to design a new data loader that progressively increases the complexity of the data. By starting with easier data points before progressing to more complex data points, akin to how a human would learn, we are reducing the flops required for similar training steps. This leads to sharper accuracy growth and overall improved training speeds.

With all of these resource-aware techniques, this thesis demonstrates that it is possible to achieve more in model training by using fewer resources.
OriginalsprogEngelsk
ForlagIT-Universitetet i København
Antal sider127
ISBN (Trykt)N/A
StatusUdgivet - 2024
NavnITU-DS
Nummer230
ISSN1602-3536

Fingeraftryk

Dyk ned i forskningsemnerne om 'Resourceful Learning: Training More Models with Fewer Resources'. Sammen danner de et unikt fingeraftryk.

Citationsformater