[Vega 64/gfx900]: poor memory bandwidth in comparison to native HIP

Hello,

I’ve recently tested NVIDIA plugin for oneAPI with BabelStream benchmarks.
The result bandwidths matches native CUDA very well.

However, there is a discrepancy between native HIP and SYCL which I wish to understand and rectify.

Spec:

  • CPU: AMD Ryzen 7 5800X
  • GPU: Radeon RX Vega 64 (gfx900)
  • OS: Rocky Linux 9.2 with 5.14.0 kernel

Software stack:

  • oneAPI: basekit 2023.2.1
  • ROCm: rocm-hip-sdk5.4.3

rocminfo

*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Uuid:                    GPU-021505f72c144864               
  Marketing Name:          AMD Radeon RX Vega                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                         

sycl-ls

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.7.0.21_160000]
[opencl:cpu:1] Intel(R) OpenCL, AMD Ryzen 7 5800X 8-Core Processor              3.0 [2023.16.7.0.21_160000]
[ext_oneapi_hip:gpu:0] AMD HIP BACKEND, AMD Radeon RX Vega gfx900:xnack- [HIP 50422.80]

BabelStream

$ git clone https://github.com/UoB-HPC/BabelStream

HIP Bandwidth

$ cd BabelStream/src/hip/
$ hipcc -O2 -DHIP -I. -I.. HIPStream.cpp ../main.cpp -o stream.x 
$ ./stream.x 
BabelStream
Version: 4.0
Implementation: HIP
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using HIP device AMD Radeon RX Vega
Driver: 50422804
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        393206.592  0.00137     0.00141     0.00138     
Mul         395044.431  0.00136     0.00141     0.00137     
Add         359068.066  0.00224     0.00229     0.00225     
Triad       358939.871  0.00224     0.00229     0.00225     
Dot         363135.835  0.00148     0.00149     0.00148     

SYCL Bandwidth

$ cd BabelStream/src/sycl2020 
$ icpx -O2 -DSYCL2020 \ 
         -fsycl -fsycl-targets=amdgcn-amd-amdhsa \
         -Xsycl-target-backend=amdgcn-amd-amd --offload-arch=gfx900 \
         -I. -I.. SYCLStream2020.cpp ../main.cpp -o stream.x 
$ ./stream.x --device 2 BabelStream
Version: 4.0
Implementation: SYCL 2020
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Using SYCL device AMD Radeon RX Vega
Driver: HIP 50422.80
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        205989.209  0.00261     0.00297     0.00273     
Mul         201782.697  0.00266     0.00299     0.00275     
Add         220715.935  0.00365     0.00394     0.00371     
Triad       226468.311  0.00356     0.00402     0.00367     
Dot         348870.293  0.00154     0.00342     0.00166     

The theoretical bandwidth of Vega 64 is 484 GB/s.
Efficiency are 74% and 46% for native HIP and SYCL, respectively.

Perhaps the paralle_for kernel is not well mapped to AMD hardwares ?

Thanks.

Thanks for the report. It looks like this might be the same issue as reported here. I’d recommend keeping an eye on that issue and we will also try to reply on this thread.