Memory access fault with AMD plugin only

Hi,

We have some (non-trivial) SYCL code compiled for both Nvidia and AMD with:

-fsycl-targets=nvidia_gpu_sm_80,amd_gpu_gfx90a -Wno-unknown-cuda-version --gcc-toolchain=/opt/rh/gcc-toolset-13/root -Xclang -opaque-pointers

SYCL compiler is IntelLLVM 2025.1 (clang)
ROCm 6.3.4
CUDA 12.6

It works like a charm on an Nvidia GPU (A100 80GB) but fails on an AMD GPU (MI210) with the following error:

Memory access fault by GPU node-4 (Agent handle: 0xf10ad0) on address 0x111000. Reason: Unknown.

Does this point to a problem in the compiler? Or the Sycl AMD plugin?

What would be the best way to investigate this kind of issues?

Cheers,
Fabrice

Hi @flg ,
It’s possible that you have found an issue here, and we have a few recommendations for diagnosing further from here. Mostly they relate to using the memory sanitiser tools to see if there’s some unexpected memory access going on.

One option is to use the normal address sanitiser while compiling your application, then use the OpenCL CPU implementation to run the kernels. To the best of my knowledge, this should find any out-of-bounds errors in the kernel as it runs.

Otherwise, you can try using the address sanitiser with the actual GPU hardware. Intel have recently improved their support for this, as detailed here: https://www.intel.com/content/www/us/en/developer/articles/technical/find-bugs-quickly-using-sanitizers-with-oneapi-compiler.html#inpage-nav-5
AMD seem to have their own support for similar debugging, as on here: Using the AddressSanitizer on a GPU (beta release) — ROCm Documentation

I’ve never tried this, so I can’t say that it will definitely work with icpx on AMD hardware, but it has to be worth a shot!