TensorSocket: Shared Data Loading for Deep Learning Training

Publikation: Artikel i tidsskrift og konference artikel i tidsskriftTidsskriftartikelForskningpeer review

Abstract

Training deep learning models is a repetitive and resource-intensive process. Data scientists often train several models before landing on a set of parameters (e.g., hyper-parameter tuning) and model architecture (e.g., neural architecture search), among other things that yield the highest accuracy. The computational efficiency of these training tasks depends highly on how well the training data is supplied to the training process. The repetitive nature of these tasks results in the same data processing pipelines running over and over, exacerbating the need for and costs of computational resources.

In this paper, we present TensorSocket to reduce the computational needs of deep learning training by enabling simultaneous training processes to share the same data loader. TensorSocket mitigates CPU-side bottlenecks in cases where the collocated training workloads have high throughput on GPU, but are held back by lower data-loading throughput on CPU. TensorSocket achieves this by reducing redundant computations and data duplication across collocated training processes and leveraging modern GPU-GPU interconnects. While doing so, TensorSocket is able to train and balance differently-sized models and serve multiple batch sizes simultaneously and is hardware- and pipeline-agnostic in nature.

Our evaluation shows that TensorSocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100%, and when utilizing cloud instances, achieves cost savings of 50% by reducing the hardware resource needs on the CPU side. Furthermore, TensorSocket outperforms the state-of-the-art solutions for shared data loading such as CoorDL and Joader; it is easier to deploy and maintain and either achieves higher or matches their throughput while requiring fewer CPU resources.
OriginalsprogEngelsk
Artikelnummer267
TidsskriftProceedings of the ACM on Management of Data
Vol/bind3
Udgave nummer4
Sider (fra-til)1-26
Antal sider27
DOI
StatusUdgivet - 22 sep. 2025
BegivenhedThe ACM Symposium on Principles of Database Systems - Bengaluru, Indien
Varighed: 31 maj 20265 jun. 2026
https://2026.sigmod.org/

Konference

KonferenceThe ACM Symposium on Principles of Database Systems
Land/OmrådeIndien
ByBengaluru
Periode31/05/202605/06/2026
Internetadresse

Fingeraftryk

Dyk ned i forskningsemnerne om 'TensorSocket: Shared Data Loading for Deep Learning Training'. Sammen danner de et unikt fingeraftryk.

Citationsformater