[Vega 64/gfx900]: poor memory bandwidth in comparison to native HIP

Hello,

I’ve recently tested NVIDIA plugin for oneAPI with BabelStream benchmarks.
The result bandwidths matches native CUDA very well.

However, there is a discrepancy between native HIP and SYCL which I wish to understand and rectify.

Spec:

  • CPU: AMD Ryzen 7 5800X
  • GPU: Radeon RX Vega 64 (gfx900)
  • OS: Rocky Linux 9.2 with 5.14.0 kernel

Software stack:

  • oneAPI: basekit 2023.2.1
  • ROCm: rocm-hip-sdk5.4.3

rocminfo

*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Uuid:                    GPU-021505f72c144864               
  Marketing Name:          AMD Radeon RX Vega                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                         

sycl-ls

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 7 5800X 8-Core Processor              3.0 [2023.16.7.0.21_160000]
[ext_oneapi_hip:gpu:0] AMD HIP BACKEND, AMD Radeon RX Vega gfx900:xnack- [HIP 50422.80]

BabelStream

$ git clone https://github.com/UoB-HPC/BabelStream

HIP Bandwidth

$ cd BabelStream/src/hip/
$ hipcc -O2 -DHIP -I. -I.. HIPStream.cpp ../main.cpp -o stream.x 
$ ./stream.x 
BabelStream
Version: 4.0
Implementation: HIP
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using HIP device AMD Radeon RX Vega
Driver: 50422804
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        393206.592  0.00137     0.00141     0.00138     
Mul         395044.431  0.00136     0.00141     0.00137     
Add         359068.066  0.00224     0.00229     0.00225     
Triad       358939.871  0.00224     0.00229     0.00225     
Dot         363135.835  0.00148     0.00149     0.00148     

SYCL Bandwidth

$ cd BabelStream/src/sycl2020 
$ icpx -O2 -DSYCL2020 \ 
         -fsycl -fsycl-targets=amdgcn-amd-amdhsa \
         -Xsycl-target-backend=amdgcn-amd-amd --offload-arch=gfx900 \
         -I. -I.. SYCLStream2020.cpp ../main.cpp -o stream.x 
$ ./stream.x --device 2 BabelStream
Version: 4.0
Implementation: SYCL 2020
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device AMD Radeon RX Vega
Driver: HIP 50422.80
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        205989.209  0.00261     0.00297     0.00273     
Mul         201782.697  0.00266     0.00299     0.00275     
Add         220715.935  0.00365     0.00394     0.00371     
Triad       226468.311  0.00356     0.00402     0.00367     
Dot         348870.293  0.00154     0.00342     0.00166     

The theoretical bandwidth of Vega 64 is 484 GB/s.
Efficiency are 74% and 46% for native HIP and SYCL, respectively.

Perhaps the paralle_for kernel is not well mapped to AMD hardwares ?

Thanks.

Thanks for the report. It looks like this might be the same issue as reported here. I’d recommend keeping an eye on that issue and we will also try to reply on this thread.

Hi @vitduck

We investigated the linked memory bandwidth issue leading to poor performance in SYCL for AMD-HIP in comparison with HIP, found the culprit, and have a solution which at current time is a LLVM compiler flag switch and not by default yet. See more detailed discussion here.

The problem was to do with generating unnecessary stack stores of zeroes for each of the kernel arguments, causing the stalls leading to poor memory bandwidth. These were unfortunately not optimised away in the compiler optimisation pipeline and were only meant to affect the case ND-Range kernels with global offset which is deprecated in SYCL 2020, so while supported is not something to be used anyways.

In order to improve the performance now you will need to add -mllvm -enable-global-offset=false in your compile command (to either clang++ or icpx) which disables those transformations leading to the extra stack stores.

Let us know if that solves the problem you are seeing and if anything is outstanding we will take a further look.

Thanks!