SYCL Poor FFT Performance Compared to CUDA in Double Precision

Describe the bug

The double precision performance of fft-sycl lags behind fft-cuda (HeCBench), achieving only 70% of CUDA’s performance. Profiling with nsys (nsys nvprof --print-gpu-trace) reveals that the SYCL code utilizes more registers (255 registers/thread) than its CUDA counterpart (72 registers/thread), leading to register spills as observed with -Xcuda-ptxas --verbose.

Interestingly, the single precision version of the code shows comparable performance between SYCL and CUDA. Although SYCL uses (83 registers/thread) and cuda uses (48 registers/thread), there is no register spill for both.

Seeking recommendations to enhance the performance of the double precision SYCL implementation would be greatly appreciated.

To reproduce

For sycl (double precission):

  • cd HeCBench/src/fft-sycl
  • make CUDA=yes CUDA_ARCH=sm_80 GCC_TOOLCHAIN=""
  • Run code: ./main 3 1000

For cuda (double precission):

  • cd HeCBench/src/fft-cuda
  • make ARCH=sm_80
  • Run code: ./main 3 1000

To build the single precision versions, ensure the SINGLE_PRECISION preprocessor directive is defined by editing main.cpp or Makefile.

Here is the result of my run on NVIDIA A100:

For Double Precision

SYCL:

~/HeCBench/src/fft-sycl$ ./main 3 1000
used_bytes=268435456, n_cmplx=16777216
FFT PASS
iFFT PASS
Average kernel execution time 0.00117844 (s)
CUDA:

~/HeCBench/src/fft-cuda$ ./main 3 1000
used_bytes=268435456, n_cmplx=1.67772e+07
FFT PASS
iFFT PASS
Average kernel execution time 0.000832436 (s)

For Single Precision

SYCL:

~/HeCBench/src/fft-sycl$ ./main 3 1000
used_bytes=268435456, n_cmplx=33554432
FFT PASS
iFFT PASS
Average kernel execution time 0.000819632 (s)
CUDA:

~/HeCBench/src/fft-cuda$ ./main 3 1000
used_bytes=268435456, n_cmplx=3.35544e+07
FFT PASS
iFFT PASS
Average kernel execution time 0.000798072 (s)

Environment

  • OS: Ubuntu 23.10
  • NVIDIA A100
  • clang++ --version:
$ clang++ --version
clang version 19.0.0git (https://github.com/intel/llvm 666cf66258363ba1c416d054cab38c85c04fe389)
Target: x86_64-unknown-linux-gnu
Thread model: posix
Build config: +assertions
  • sycl-ls --verbose:
$ sycl-ls --verbose
[opencl:fpga][opencl:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu][opencl:1] Intel(R) OpenCL,            Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[cuda:gpu][cuda:0] NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]

Platforms: 3
Platform [#1]:
    Version  : OpenCL 1.2 Intel(R) FPGA SDK for OpenCL(TM), Version 20.3
    Name     : Intel(R) FPGA Emulation Platform for OpenCL(TM)
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#0]:
        Type              : fpga
        Version           : OpenCL 1.2 
        Name              : Intel(R) FPGA Emulation Device
        Vendor            : Intel(R) Corporation
        Driver            : 2024.17.5.0.08_160000.xmain-hotfix
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : accelerator fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_atomic_host_allocations usm_atomic_shared_allocations ext_oneapi_srgb ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_intel_fpga_task_sequence ext_oneapi_private_alloca
        info::device::sub_group_sizes: 4 8 16 32 64
        Architecture: unknown
Platform [#2]:
    Version  : OpenCL 3.0 LINUX
    Name     : Intel(R) OpenCL
    Vendor   : Intel(R) Corporation
    Devices  : 1
        Device [#1]:
        Type              : cpu
        Version           : OpenCL 3.0 (Build 0)
        Name              :            Intel(R) Xeon(R) CPU @ 2.20GHz
        Vendor            : Intel(R) Corporation
        Driver            : 2024.17.5.0.08_160000.xmain-hotfix
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : cpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations usm_system_allocations usm_atomic_host_allocations usm_atomic_shared_allocations atomic64 ext_oneapi_srgb ext_oneapi_native_assert ext_intel_legacy_image ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_tangle_group ext_oneapi_private_alloca
        info::device::sub_group_sizes: 4 8 16 32 64
        Architecture: x86_64
Platform [#3]:
    Version  : CUDA 12.4
    Name     : NVIDIA CUDA BACKEND
    Vendor   : NVIDIA Corporation
    Devices  : 1
        Device [#0]:
        Type              : gpu
        Version           : 8.0
        Name              : NVIDIA A100-SXM4-40GB
        Vendor            : NVIDIA Corporation
        Driver            : CUDA 12.4
        UUID              : 9031191727913913712018114216122172201180135
        Num SubDevices    : 0
        Num SubSubDevices : 0
        Aspects           : gpu fp16 fp64 online_compiler online_linker queue_profiling usm_device_allocations usm_host_allocations usm_shared_allocations ext_intel_pci_address usm_atomic_shared_allocations atomic64 ext_intel_device_info_uuid ext_oneapi_bfloat16_math_functions ext_intel_free_memory ext_intel_device_id ext_intel_memory_clock_rate ext_intel_memory_bus_widthImages are not fully supported by the CUDA BE, their support is disabled by default. Their partial support can be activated by setting SYCL_PI_CUDA_ENABLE_IMAGE_SUPPORT environment variable at runtime.
 ext_oneapi_bindless_images ext_oneapi_bindless_images_shared_usm ext_oneapi_bindless_images_2d_usm ext_oneapi_interop_memory_import ext_oneapi_interop_semaphore_import ext_oneapi_mipmap ext_oneapi_mipmap_anisotropy ext_oneapi_mipmap_level_reference ext_oneapi_ballot_group ext_oneapi_fixed_size_group ext_oneapi_opportunistic_group ext_oneapi_graph ext_oneapi_limited_graph ext_oneapi_cubemap ext_oneapi_cubemap_seamless_filtering ext_oneapi_bindless_sampled_image_fetch_1d_usm ext_oneapi_bindless_sampled_image_fetch_2d_usm ext_oneapi_bindless_sampled_image_fetch_2d ext_oneapi_bindless_sampled_image_fetch_3d ext_oneapi_queue_profiling_tag
        info::device::sub_group_sizes: 32
        Architecture: nvidia_gpu_sm_80
default_selector()      : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]
accelerator_selector()  : fpga, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
cpu_selector()          : cpu, Intel(R) OpenCL,            Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
gpu_selector()          : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]
custom_selector(gpu)    : gpu, NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-40GB 8.0 [CUDA 12.4]
custom_selector(cpu)    : cpu, Intel(R) OpenCL,            Intel(R) Xeon(R) CPU @ 2.20GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
custom_selector(acc)    : fpga, Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]

Hi @yencal,
I could reproduce the issue and we will be looking into why it’s happening. Right now, I can offer a workaround forcing the max number of registers in the PTX->assembly compilation. Try adding:

-Xcuda-ptxas --maxrregcount=64

in the compiler invocation. This recovered the performance for me on the A100 GPU. Please let us know if you see the same.

Thanks,
Rafal

Hi @yencal,
we found the reason for the difference and it’s because the code is different. These functions:

are declared inline in the SYCL version but are not inlined in the CUDA version:

When I changed the inline to [[clang::noinline]], I see the SYCL version matching the CUDA performance.

Hope this helps!
Thanks,
Rafal

Hi @rbielski,
Thanks for the insight. I did notice the inline functions in the sycl code. I removed them but that did not work. But using [[clang::noinline]] to force the compiler not to inline does the trick (90% cuda performance even with sycl::ext::oneapi::property::queue::discard_events{}). It would be nice if the sycl compiler could make better decisions about when to inline or not inline. Instead of having to manually force [[clang::noinline]].

Thanks for your help.

Caleb Yenusah

Hi @rbielski
I actually noticed that only the exp_i function should not be inlined ([[clang::noinline]]) to match the cuda performance. If [[clang::noinline]] is applied to all the other functions, the performance drops substantially (~1% cuda performance).

Here is what gave the best performance:

[[clang::noinline]] T2 exp_i( T phi ) { 
  return (T2){ sycl::cos(phi), sycl::sin(phi) };
}

inline T2 cmplx_mul( T2 a, T2 b ) { return (T2){ a.x()*b.x()-a.y()*b.y(), a.x()*b.y()+a.y()*b.x() }; }

inline T2 cm_fl_mul( T2 a, T  b ) { return (T2){ b*a.x(), b*a.y() }; }

inline T2 cmplx_add( T2 a, T2 b ) { return (T2){ a.x() + b.x(), a.y() + b.y() }; }

inline T2 cmplx_sub( T2 a, T2 b ) { return (T2){ a.x() - b.x(), a.y() - b.y() }; }

Hi @rbielski,

Upon further analysis, although the CUDA versions of the functions did not explicitly specify inline, these functions were indeed inlined, as evident from the PTX. Similarly, in the SYCL versions, the functions are also inlined regardless of whether they are explicitly marked as inline. Therefore, the difference in performance between the two versions is not due to the inline keyword.

The issue seems to lie with the sycl::cos and sycl::sin functions. When these functions are inlined, they significantly increase register usage, which adversely affects performance. That is why when [[clang::noinline]] is applied to the exp_i function, register usage is reduced, improving performance in the SYCL version.

Interestingly, when cos and sin are inlined in the CUDA version, the register usage does not increase dramatically as it does in the SYCL version.

Thank you,
Caleb

Thank you for your analysis of the benchmark.

@rbielski
Any comments on the impacts of the sycl::sin()/sycl::cos() functions ?

Hi @zjin-lcf. We’re actively looking into the problem you’ve raised and I’ll get back to you with our findings.

1 Like

I thought I’d write up our initial findings since it’s been a while without a response. I have since been pulled in a few other directions but we will get back to the investigation at some point.

We started investigating this by reducing the kernel(s) as far as we could while still seeing interesting differences in register usage.

We narrowed things down to a SYCL kernel as follows:

class fft1D_512 {
public:
  fft1D_512(T2 *_work) : work{_work} {};

  void operator()(sycl::nd_item<1> item) const {
    int tid = item.get_local_id(0);
    int gid = item.get_group(0) + tid;
    T fid = tid;
    work[gid].x() = sycl::sin(fid);
    work[gid].y() = sycl::sin(fid);
  }

private:
  T2 *work;
};

The CUDA equivalent should be fairly obvious.

For SM89, we saw 30 registers for CUDA and 40 registers for SYCL. This is despite (or because of) the SYCL compiler doing a much better job at optimizing redundant code. The PTX for the SYCL kernel is far smaller than the CUDA PTX, and profiling suggests it executes 10% fewer instructions. The CUDA-generated PTX calls the __internal_trig_reduction_slowpathd method - twice, for equivalent calls to sin. The SYCL compiler is able to remove both of those calls completely through range analysis and constant propagation. Perhaps nvcc will learn these optimizations too in a future version, since it’s based on LLVM. Or maybe it’ll leave that sort of optimization to ptxas.

Playing around with the SYCL-generated PTX I was able to manually predicate some of the control flow and get down to 30 registers with one method, 26 with another, and 40 with a third.

This is unfortunate for us because the advice we’ve received is not to predicate PTX, and let ptxas do the optimal thing. In this reduced example, though, ptxas is clearly not doing the optimal thing (in terms of register usage, anyway). We don’t have any insight into what ptxas does or when, or how best to get LLVM to generate the kind of PTX that ptxas will do the best job on across the board. I suspect this would be a big undertaking, especially to enable it by default. The wide range in register counts (26, 30, 40) for equivalent PTX code shows how brittle this sort of thing can be.

Whether or not this reduced example is representative of the original problem is unclear, but I hope that some background and context will help. For the original problem, I can also say that something in the unrolled sin/cos loop is causing LLVM to sink a lot of code from all loop iterations into the epilogue, which might be extending the live ranges too far and causing undue register pressure. But we’ve also seen that in extremely small examples, control-flow patterns present in sin/cos cause LLVM to produce PTX that ptxas struggles to make the optimal decisions with.

That’s what I need to come back to when I have the time. For now I’d still stick with using --maxrregcount until we find a better solution.

1 Like