TY - JOUR
T1 - OSM: Off-Chip Shared Memory for GPUs.
AU - Darabi, Sina
AU - Yousefzadeh-Asl-Miandoab, Ehsan
AU - Akbarzadeh, Negar
AU - Falahati, Hajar
AU - Lotfi-Kamran, Pejman
AU - Sadrosadati, Mohammad
AU - Sarbazi-Azad, Hamid
N1 - DBLP License: DBLP's bibliographic metadata records provided through http://dblp.org/ are distributed under a Creative Commons CC0 1.0 Universal Public Domain Dedication. Although the bibliographic metadata records are provided consistent with CC0 1.0 Dedication, the content described by the metadata records is not. Content may be subject to copyright, rights of privacy, rights of publicity and other restrictions.
PY - 2022/2/24
Y1 - 2022/2/24
N2 - Graphics Processing Units (GPUs) employ a shared memory, a software-managed cache for programmers, in each streaming multiprocessor to accelerate data sharing among the threads in a thread block. Although 60% of the shared memory space is underutilized, on average, there are some workloads that demand higher shared memory capacities. Therefore, improving shared memory utilization while satisfying the needs of shared memory intensive workloads is challenging. We make a key observation that the lifetime of each shared memory address is significantly shorter than the execution time of a thread block. In this paper, we first propose Off-Chip Shared Memory (OSM) that allocates shared memory space in the off-chip memory, and accelerates accesses to it via a small on-chip cache. Using an 8 KB cache for shared memory addresses, OSM provides almost the same performance as the baseline GPU that uses 96 KB on-chip shared memory. OSM improves GPU performance in two ways. First, it allocates higher shared memory capacities in the off-chip memory, and improves thread-level parallelism (TLP). Second, it designs a unified cache for shared memory and global address spaces, providing more caching space for global memory address space even for the workloads with high shared memory utilization. Our experimental results show an average 21% and 18% IPC improvement compared to the baseline and the state-of-the-art architectures.
AB - Graphics Processing Units (GPUs) employ a shared memory, a software-managed cache for programmers, in each streaming multiprocessor to accelerate data sharing among the threads in a thread block. Although 60% of the shared memory space is underutilized, on average, there are some workloads that demand higher shared memory capacities. Therefore, improving shared memory utilization while satisfying the needs of shared memory intensive workloads is challenging. We make a key observation that the lifetime of each shared memory address is significantly shorter than the execution time of a thread block. In this paper, we first propose Off-Chip Shared Memory (OSM) that allocates shared memory space in the off-chip memory, and accelerates accesses to it via a small on-chip cache. Using an 8 KB cache for shared memory addresses, OSM provides almost the same performance as the baseline GPU that uses 96 KB on-chip shared memory. OSM improves GPU performance in two ways. First, it allocates higher shared memory capacities in the off-chip memory, and improves thread-level parallelism (TLP). Second, it designs a unified cache for shared memory and global address spaces, providing more caching space for global memory address space even for the workloads with high shared memory utilization. Our experimental results show an average 21% and 18% IPC improvement compared to the baseline and the state-of-the-art architectures.
KW - Instruction sets
KW - Graphics processing units
KW - System-on-chip
KW - Memory management
KW - Proposals
KW - Bandwidth
KW - Registers
U2 - 10.1109/TPDS.2022.3154315
DO - 10.1109/TPDS.2022.3154315
M3 - Journal article
VL - 33
SP - 3415
EP - 3429
JO - IEEE Trans. Parallel Distributed Syst.
JF - IEEE Trans. Parallel Distributed Syst.
IS - 12
M1 - 12
ER -