Memory access fault with AMD plugin only

flg · 2 April 2025 14:22

Hi,

We have some (non-trivial) SYCL code compiled for both Nvidia and AMD with:

-fsycl-targets=nvidia_gpu_sm_80,amd_gpu_gfx90a -Wno-unknown-cuda-version --gcc-toolchain=/opt/rh/gcc-toolset-13/root -Xclang -opaque-pointers

SYCL compiler is IntelLLVM 2025.1 (clang)
ROCm 6.3.4
CUDA 12.6

It works like a charm on an Nvidia GPU (A100 80GB) but fails on an AMD GPU (MI210) with the following error:

Memory access fault by GPU node-4 (Agent handle: 0xf10ad0) on address 0x111000. Reason: Unknown.

Does this point to a problem in the compiler? Or the Sycl AMD plugin?

What would be the best way to investigate this kind of issues?

Cheers,
Fabrice

duncan · 2 April 2025 18:20

Hi @flg ,
It’s possible that you have found an issue here, and we have a few recommendations for diagnosing further from here. Mostly they relate to using the memory sanitiser tools to see if there’s some unexpected memory access going on.

One option is to use the normal address sanitiser while compiling your application, then use the OpenCL CPU implementation to run the kernels. To the best of my knowledge, this should find any out-of-bounds errors in the kernel as it runs.

Otherwise, you can try using the address sanitiser with the actual GPU hardware. Intel have recently improved their support for this, as detailed here: https://www.intel.com/content/www/us/en/developer/articles/technical/find-bugs-quickly-using-sanitizers-with-oneapi-compiler.html#inpage-nav-5
AMD seem to have their own support for similar debugging, as on here: Using the AddressSanitizer on a GPU (beta release) — ROCm Documentation

I’ve never tried this, so I can’t say that it will definitely work with icpx on AMD hardware, but it has to be worth a shot!

Topic		Replies	Views
Debugging SYCL code on NVIDIA GPU oneAPI for NVIDIA GPUs	7	149	2 October 2024
Failed to run a sample application oneAPI for AMD GPUs	6	126	18 December 2024
AMD GPU SYCL Plugin urDeviceGetGlobalTimestamps Error oneAPI for AMD GPUs	13	114	1 November 2024
oneAPI for AMD GPUs and ROCM 5.x oneAPI for AMD GPUs	3	807	1 May 2023
Cuda_error_illegal_address oneAPI for NVIDIA GPUs	4	293	28 March 2024

Memory access fault with AMD plugin only

Related topics