AMD/HIP sycl-ls segfault

Hello,

I’m getting segfault from sycl-ls when working with a AMD GPU. I’m working in a constrained environment so there might very well be environment or incompatibility issues; but I can’t find how to investigate, even less solve the issue. Any help would be appreciated.

Here’s what I get:

$ SYCL_PI_TRACE=-1 oneapi/sycl-ls 
SYCL_PI_TRACE[-1]: dlopen(/net/home/ppd/flegoff/oneapi/compiler/2024.2/lib/libpi_level_zero.so) failed with <libze_loader.so.1: cannot open shared object file: No such file or directory>
SYCL_PI_TRACE[-1]: dlopen(/net/home/ppd/flegoff/oneapi/compiler/2024.2/lib/libpi_cuda.so) failed with </net/home/ppd/flegoff/oneapi/compiler/2024.2/lib/libpi_cuda.so: cannot open shared object file: No such file or directory>
SYCL_PI_TRACE[-1]: dlopen(/net/home/ppd/flegoff/oneapi/compiler/2024.2/lib/libpi_native_cpu.so) failed with </net/home/ppd/flegoff/oneapi/compiler/2024.2/lib/libpi_native_cpu.so: cannot open shared object file: No such file or directory>
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[all]: Check if plugin is present. Failed to load plugin: libpi_level_zero.so
SYCL_PI_TRACE[all]: Check if plugin is present. Failed to load plugin: libpi_cuda.so
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_hip.so [ PluginVersion: 15.49.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[all]: Check if plugin is present. Failed to load plugin: libpi_native_cpu.so
---> piPlatformsGet(
	<unknown> : 0
	<nullptr>
	<unknown> : 0x7ffee0bf871c
) ---> 	pi_result : PI_SUCCESS

---> piPlatformsGet(
	<unknown> : 1
	<unknown> : 0x55690b557dc0
	<nullptr>
) ---> 	pi_result : PI_SUCCESS
	[out]<unknown> ** : 0x55690b557dc0[ 0x7f7f19ff2f90 ... ]

---> piPlatformGetInfo(
	pi_platform : 0x7f7f19ff2f90
	<unknown> : 135168
	<unknown> : 4
	<unknown> : 0x7ffee0bf867c
	<nullptr>
) ---> 	pi_result : PI_SUCCESS

---> piPlatformGetInfo(
	pi_platform : 0x7f7f19ff2f90
	<unknown> : 2306
	<unknown> : 0
	<nullptr>
	<unknown> : 0x7ffee0bf85b8
) ---> 	pi_result : PI_SUCCESS

---> piPlatformGetInfo(
	pi_platform : 0x7f7f19ff2f90
	<unknown> : 2306
	<unknown> : 36
	<char * > : 0x55690b7615b0
	<nullptr>
) ---> 	pi_result : PI_SUCCESS

---> piPlatformGetInfo(
	pi_platform : 0x7f7f19ff2f90
	<unknown> : 2306
	<unknown> : 0
	<nullptr>
	<unknown> : 0x7ffee0bf85b8
) ---> 	pi_result : PI_SUCCESS

---> piPlatformGetInfo(
	pi_platform : 0x7f7f19ff2f90
	<unknown> : 2306
	<unknown> : 36
	<char * > : 0x55690b657cf0
	<nullptr>
) ---> 	pi_result : PI_SUCCESS

SYCL_PI_TRACE[all]: AMD Accelerated Parallel Processing OpenCL platform found but is not compatible.
---> piPlatformsGet(
	<unknown> : 0
	<nullptr>
	<unknown> : 0x7ffee0bf871c
Segmentation fault (core dumped)

The stacktrace:

(gdb) bt
#0  0x00007ffff42e1c3d in amd::Command::enqueue() () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#1  0x00007ffff40bc895 in hip::Event::enqueueRecordCommand(ihipStream_t*, amd::Command*, bool) () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#2  0x00007ffff40be6d3 in hip::Event::addMarker(ihipStream_t*, amd::Command*, bool) () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#3  0x00007ffff40c2c67 in hipEventRecord_common(ihipEvent_t*, ihipStream_t*) () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#4  0x00007ffff40c32f6 in hipEventRecord () from /opt/rocm-6.0.0/lib/libamdhip64.so.6
#5  0x00007ffff596dee7 in std::call_once<urPlatformGet::{lambda(ur_result_t&)#1}, ur_result_t&>(std::once_flag&, urPlatformGet::{lambda(ur_result_t&)#1}&&, ur_result_t&)::{lambda()#2}::_FUN() ()
   from /net/home/ppd/flegoff/oneapi/compiler/2024.2/lib/libpi_hip.so
#6  0x00007ffff748ee18 in __pthread_once_slow () from /lib64/libc.so.6
#7  0x00007ffff596d7e8 in urPlatformGet () from /net/home/ppd/flegoff/oneapi/compiler/2024.2/lib/libpi_hip.so
#8  0x00007ffff597b912 in piPlatformsGet () from /net/home/ppd/flegoff/oneapi/compiler/2024.2/lib/libpi_hip.so
#9  0x00007ffff7e11d7a in _pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)0, int, decltype(nullptr), unsigned int*>(int, decltype(nullptr), unsigned int*) const ()
   from /home/ppd/flegoff/oneapi/compiler/2024.2/lib/libsycl.so.7
#10 0x00007ffff7e0c72b in sycl::_V1::detail::platform_impl::get_platforms()::$_0::operator()(std::shared_ptr<sycl::_V1::detail::plugin>&) const () from /home/ppd/flegoff/oneapi/compiler/2024.2/lib/libsycl.so.7
#11 0x00007ffff7e0ba81 in sycl::_V1::detail::platform_impl::get_platforms() () from /home/ppd/flegoff/oneapi/compiler/2024.2/lib/libsycl.so.7
#12 0x00007ffff7f12899 in sycl::_V1::platform::get_platforms() () from /home/ppd/flegoff/oneapi/compiler/2024.2/lib/libsycl.so.7
#13 0x000055555555ba09 in main ()

Here’s how I installed it:

$ ls -1 oneapi/
compiler
debugger
dpl
sycl-ls
tbb
tcm
umf
$ export LD_LIBRARY_PATH=$PWD/oneapi/tcm/1.2/lib:$PWD/oneapi/umf/0.9/lib:$PWD/oneapi/tbb/2022.0/env/../lib/intel64/gcc4.8:$PWD/oneapi/dpl/2022.6/lib:$PWD/oneapi/debugger/2024.2/opt/debugger/lib:$PWD/oneapi/compiler/2024.2/opt/compiler/lib:$PWD/oneapi/compiler/2024.2/lib

For what it’s worth sycl-ls works just fine on the installation machine but since there’s no GPU the HIP plugin is not loaded on this machine.

Best,
Fabrice

Hi @flg,
The first thing I’ve spotted is that your oneAPI and plugin versions appear to be mismatching (oneAPI 2024.2, plugin 2024.1), if those are the URLs you are using in your install scripts. I don’t imagine that 2 patch versions would make a difference either, but we test against ROCm 6.0.2 internally, and that’s maybe another potential source of mismatch.

Beyond that, 2025.0 has been released, and included some large changes including the removal of the PI layer. It would be a touch easier to debug with the improved Unified Runtime output.

Hi @duncan,

My bad, it’s a typo: the oneapi plugin version is for 2024.2.1. I’ll give it a try with 2025.0.

Thanks.

I switched to OneAPI 2025.0 as suggested. ROCm version is now 6.1.0. The plugin is libur_adapter_hip.so.0.10.0.

I also made an apptainer container to exclude environment/copying issues and ease reproducibility.

I still get a segfault.

rocm-smi finds the device OK:

Apptainer> rocm-smi 


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
0    32.0c           41.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
1    33.0c           40.0W   800Mhz  1600Mhz  0%   auto  300.0W    0%   0%    
====================================================================================
=============================== End of ROCm SMI Log ================================

Hip device listing segfaults:

Apptainer> ONEAPI_DEVICE_SELECTOR="hip:*" SYCL_UR_TRACE=-1 sycl-ls
INFO: Output filtered by ONEAPI_DEVICE_SELECTOR environment variable, which is set to hip:*.
To see device ids, use the --ignore-device-selectors CLI option.

<LOADER>[INFO]: loaded adapter 0x0x55f31f76b140 (libur_adapter_hip.so.0)
---> urAdapterGet(.NumEntries = 0, .phAdapters = {}, .pNumAdapters = 0x7ffcd7117f1c (1)) -> UR_RESULT_SUCCESS;
---> urAdapterGet(.NumEntries = 1, .phAdapters = {0x7f5735b6d190}, .pNumAdapters = nullptr) -> UR_RESULT_SUCCESS;
---> urAdapterGetInfo(.hAdapter = 0x7f5735b6d190, .propName = UR_ADAPTER_INFO_BACKEND, .propSize = 4, .pPropValue = 0x7ffcd7117f80 (UR_ADAPTER_BACKEND_HIP), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urPlatformGetSegmentation fault (core dumped)

Full trace without filtering:

Apptainer> SYCL_UR_TRACE=-1 sycl-ls
<LOADER>[INFO]: failed to load adapter 'libur_adapter_level_zero.so.0' with error: libze_loader.so.1: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter '/opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_level_zero.so.0' with error: libze_loader.so.1: cannot open shared object file: No such file or directory
<LOADER>[INFO]: loaded adapter 0x0x55b461a5ad50 (libur_adapter_opencl.so.0)
<LOADER>[INFO]: failed to load adapter 'libur_adapter_cuda.so.0' with error: libur_adapter_cuda.so.0: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter '/opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_cuda.so.0' with error: /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_cuda.so.0: cannot open shared object file: No such file or directory
<LOADER>[INFO]: loaded adapter 0x0x55b461a5d4e0 (libur_adapter_hip.so.0)
<LOADER>[INFO]: failed to load adapter 'libur_adapter_native_cpu.so.0' with error: libur_adapter_native_cpu.so.0: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter '/opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_native_cpu.so.0' with error: /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_native_cpu.so.0: cannot open shared object file: No such file or directory
---> urAdapterGet(.NumEntries = 0, .phAdapters = {}, .pNumAdapters = 0x7ffe7aa2912c (2)) -> UR_RESULT_SUCCESS;
---> urAdapterGet(.NumEntries = 2, .phAdapters = {0x55b461b520c0, 0x55b461b52100}, .pNumAdapters = nullptr) -> UR_RESULT_SUCCESS;
---> urAdapterGetInfo(.hAdapter = 0x55b461b520c0, .propName = UR_ADAPTER_INFO_BACKEND, .propSize = 4, .pPropValue = 0x7ffe7aa29190 (UR_ADAPTER_BACKEND_OPENCL), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urAdapterGetInfo(.hAdapter = 0x55b461b52100, .propName = UR_ADAPTER_INFO_BACKEND, .propSize = 4, .pPropValue = 0x7ffe7aa29190 (UR_ADAPTER_BACKEND_HIP), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urPlatformGet(.phAdapters = {0x55b461b520c0}, .NumAdapters = 1, .NumEntries = 0, .phPlatforms = {}, .pNumPlatforms = 0x7ffe7aa291dc (1)) -> UR_RESULT_SUCCESS;
---> urPlatformGet(.phAdapters = {0x55b461b520c0}, .NumAdapters = 1, .NumEntries = 1, .phPlatforms = {0x55b461cdc730}, .pNumPlatforms = nullptr) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_BACKEND, .propSize = 4, .pPropValue = 0x7ffe7aa2924c (UR_PLATFORM_BACKEND_OPENCL), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_NAME, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 0x7ffe7aa29178 (16)) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_NAME, .propSize = 16, .pPropValue = 0x55b461cedae0 (Intel(R) OpenCL), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_NAME, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 0x7ffe7aa29178 (16)) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_NAME, .propSize = 16, .pPropValue = 0x55b461cedae0 (Intel(R) OpenCL), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceGet(.hPlatform = 0x55b461cdc730, .DeviceType = UR_DEVICE_TYPE_ALL, .NumEntries = 0, .phDevices = {}, .pNumDevices = 0x7ffe7aa291c4 (1)) -> UR_RESULT_SUCCESS;
---> urDeviceGet(.hPlatform = 0x55b461cdc730, .DeviceType = UR_DEVICE_TYPE_ALL, .NumEntries = 1, .phDevices = {0x55b461cedef0}, .pNumDevices = nullptr) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_VERSION, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 0x7ffe7aa28f08 (17)) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_VERSION, .propSize = 17, .pPropValue = 0x55b461cee450 (OpenCL 3.0 LINUX), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_NAME, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 0x7ffe7aa28f08 (16)) -> UR_RESULT_SUCCESS;
---> urPlatformGetInfo(.hPlatform = 0x55b461cdc730, .propName = UR_PLATFORM_INFO_NAME, .propSize = 16, .pPropValue = 0x55b461cee450 (Intel(R) OpenCL), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_TYPE, .propSize = 4, .pPropValue = 0x55b461c4b468 (UR_DEVICE_TYPE_CPU), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_PARENT_DEVICE, .propSize = 8, .pPropValue = 0x55b461c4b470 (nullptr), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceRetain(.hDevice = 0x55b461cedef0) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_EXTENSIONS, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 0x7ffe7aa28bf8 (1012)) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_EXTENSIONS, .propSize = 1012, .pPropValue = 0x55b461cf4b20 (cl_khr_spirv_linkonce_odr cl_khr_fp64 cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_extended_bit_ops cl_khr_icd cl_khr_il_program cl_khr_suggested_local_work_size cl_intel_unified_shared_memory cl_intel_devicelib_assert cl_khr_subgroup_ballot cl_khr_subgroup_shuffle cl_khr_subgroup_shuffle_relative cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_arithmetic cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_clustered_reduce cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_required_subgroup_size cl_intel_spirv_subgroups cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_intel_device_attribute_query cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_intel_device_partition_by_names cl_khr_spir cl_khr_image2d_from_buffer cl_intel_concurrent_dispatch), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_TYPE, .propSize = 4, .pPropValue = 0x7ffe7aa28e3c (UR_DEVICE_TYPE_CPU), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_VENDOR_ID, .propSize = 4, .pPropValue = 0x7ffe7aa28f28 (32902), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_DRIVER_VERSION, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 0x7ffe7aa28ec8 (23)) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_DRIVER_VERSION, .propSize = 23, .pPropValue = 0x55b461cf5c60 (2024.18.10.0.08_160000), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_NAME, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 0x7ffe7aa28ea8 (48)) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_NAME, .propSize = 48, .pPropValue = 0x55b461cf5d10 (AMD EPYC 7F72 24-Core Processor                ), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceRelease(.hDevice = 0x55b461cedef0) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_TYPE, .propSize = 4, .pPropValue = 0x55b461c4b0c8 (UR_DEVICE_TYPE_CPU), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_PARENT_DEVICE, .propSize = 8, .pPropValue = 0x55b461c4b0d0 (nullptr), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceRetain(.hDevice = 0x55b461cedef0) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_EXTENSIONS, .propSize = 0, .pPropValue = nullptr, .pPropSizeRet = 0x7ffe7aa28f18 (1012)) -> UR_RESULT_SUCCESS;
---> urDeviceGetInfo(.hDevice = 0x55b461cedef0, .propName = UR_DEVICE_INFO_EXTENSIONS, .propSize = 1012, .pPropValue = 0x55b461cf4f20 (cl_khr_spirv_linkonce_odr cl_khr_fp64 cl_khr_fp16 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_extended_bit_ops cl_khr_icd cl_khr_il_program cl_khr_suggested_local_work_size cl_intel_unified_shared_memory cl_intel_devicelib_assert cl_khr_subgroup_ballot cl_khr_subgroup_shuffle cl_khr_subgroup_shuffle_relative cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_arithmetic cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_clustered_reduce cl_intel_subgroups cl_intel_subgroups_char cl_intel_subgroups_short cl_intel_subgroups_long cl_intel_required_subgroup_size cl_intel_spirv_subgroups cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_intel_device_attribute_query cl_intel_exec_by_local_thread cl_intel_vec_len_hint cl_intel_device_partition_by_names cl_khr_spir cl_khr_image2d_from_buffer cl_intel_concurrent_dispatch), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
---> urDeviceRelease(.hDevice = 0x55b461cedef0) -> UR_RESULT_SUCCESS;
---> urDeviceRelease(.hDevice = 0x55b461cedef0) -> UR_RESULT_SUCCESS;
---> urPlatformGetSegmentation fault (core dumped)

Stack trace from gdb-oneapi sycl-ls (no debug info from libamdhip64 unfortunately):

Thread 1 "sycl-ls" received signal SIGSEGV, Segmentation fault.
0x00007fffeabfad0d in ?? () from /opt/rocm-6.1.0/lib/libamdhip64.so.6
(gdb) bt
#0  0x00007fffeabfad0d in ?? () from /opt/rocm-6.1.0/lib/libamdhip64.so.6
#1  0x00007fffea9a4cb9 in ?? () from /opt/rocm-6.1.0/lib/libamdhip64.so.6
#2  0x00007fffea9a6be7 in ?? () from /opt/rocm-6.1.0/lib/libamdhip64.so.6
#3  0x00007fffea9ab36f in ?? () from /opt/rocm-6.1.0/lib/libamdhip64.so.6
#4  0x00007fffea9aba2e in ?? () from /opt/rocm-6.1.0/lib/libamdhip64.so.6
#5  0x00007ffff4f40626 in std::call_once<urPlatformGet::{lambda(ur_result_t&)#1}, ur_result_t&>(std::once_flag&, urPlatformGet::{lambda(ur_result_t&)#1}&&, ur_result_t&)::{lambda()#2}::_FUN() ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_hip.so.0
#6  0x00007ffff7766ec3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x00007ffff4f40198 in urPlatformGet () from /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_hip.so.0
#8  0x00007ffff70923c8 in ur_loader::urPlatformGet(ur_adapter_handle_t_**, unsigned int, unsigned int, ur_platform_handle_t_**, unsigned int*) () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#9  0x00007ffff70a189a in urPlatformGet () from /opt/intel/oneapi/compiler/2025.0/lib/libur_loader.so.0
#10 0x00007ffff7ec0422 in std::call_once<sycl::_V1::detail::plugin::getUrPlatforms()::{lambda()#1}>(std::once_flag&, sycl::_V1::detail::plugin::getUrPlatforms()::{lambda()#1}&&)::{lambda()#2}::__invoke() ()
   from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#11 0x00007ffff7766ec3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007ffff7ebb179 in sycl::_V1::detail::platform_impl::get_platforms() () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#13 0x00007ffff7fa374d in sycl::_V1::platform::get_platforms() () from /opt/intel/oneapi/compiler/2025.0/lib/libsycl.so.8
#14 0x000055555555d709 in main ()

I’ll try to get the strack track with the symbols from ROCm libs.

For what it’s worth, kernel and driver versions:

name:           amdgpu
vermagic:       5.14.0-427.42.1.el9_4.x86_64 SMP preempt mod_unload modversions
rhelversion:    9.4
srcversion:     8DF76864569ABCCA399E1E1

Any idea would be appreciated.

Thank you for the updated information with the latest release. I have one more idea before debugging further - could you try with a recent open-source DPC++ daily build from:

We made a recent fix to the function which crashes for you, urPlatformGet:

The symptom of that issue was a clear error from HIP rather than a segfault, but it might be still related.

Hello Rafal,

Thanks for the prompt reply. With the build from 19/11/2024 I still get a segfault. The output is slightly different (mostly from different module loading IIUC):

Apptainer> SYCL_UR_TRACE=-1 sycl-ls
<LOADER>[INFO]: failed to load adapter 'libur_adapter_level_zero.so.0' with error: libze_loader.so.1: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter '/opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_level_zero.so.0' with error: libze_loader.so.1: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter 'libur_adapter_opencl.so.0' with error: libOpenCL.so.1: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter '/opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_opencl.so.0' with error: libOpenCL.so.1: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter 'libur_adapter_cuda.so.0' with error: libcuda.so.1: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter '/opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_cuda.so.0' with error: libcuda.so.1: cannot open shared object file: No such file or directory
<LOADER>[INFO]: loaded adapter 0x0x5600f1e1f590 (libur_adapter_hip.so.0)
<LOADER>[INFO]: failed to load adapter 'libur_adapter_native_cpu.so.0' with error: libur_adapter_native_cpu.so.0: cannot open shared object file: No such file or directory
<LOADER>[INFO]: failed to load adapter '/opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_native_cpu.so.0' with error: /opt/intel/oneapi/compiler/2025.0/lib/libur_adapter_native_cpu.so.0: cannot open shared object file: No such file or directory
   ---> urAdapterGet
   <--- urAdapterGet(.NumEntries = 0, .phAdapters = {}, .pNumAdapters = 0x7ffe2b70ca3c (1)) -> UR_RESULT_SUCCESS;
   ---> urAdapterGet
   <--- urAdapterGet(.NumEntries = 1, .phAdapters = {0x7ff7d2086090}, .pNumAdapters = nullptr) -> UR_RESULT_SUCCESS;
   ---> urAdapterGetInfo
   <--- urAdapterGetInfo(.hAdapter = 0x7ff7d2086090, .propName = UR_ADAPTER_INFO_BACKEND, .propSize = 4, .pPropValue = 0x7ffe2b70ca48 (UR_ADAPTER_BACKEND_HIP), .pPropSizeRet = nullptr) -> UR_RESULT_SUCCESS;
   ---> urPlatformGet
Segmentation fault (core dumped)

Hi Fabrice,
from your rocm-smi output I understand you have 2 AMD GPUs on this machine, is that right? Which model are they?

Just to narrow down whether the issue is related to the presence of multiple GPUs, could you try:

HIP_VISIBLE_DEVICES=0 sycl-ls  # only first GPU visible
HIP_VISIBLE_DEVICES=1 sycl-ls  # only second GPU visible
HIP_VISIBLE_DEVICES= sycl-ls  # no AMD GPU visible to SYCL runtime

If it still crashes with one GPU visible, then let’s try to see if your ROCm / driver installation is working correctly before we blame SYCL runtime. Could you try compiling the code below with hipcc and running it? This is a reproducer for the issue from that PR I linked, not related to your crash as we just saw, but it executes some basic HIP commands that sycl-ls would also do.

#include <hip/hip_runtime.h>
#include <iostream>
#include <string>

int s_errors{0};

void check(hipError_t res, std::string fname) {
  if (res!=hipSuccess) {
    ++s_errors;
    const char *ErrorString = hipGetErrorString(res);
    const char *ErrorName = hipGetErrorName(res);
    std::cout << "HIP error in function " << fname
              << "\nvalue: " << res
              << "\nname: " << ErrorName
              << "\ndescription: " << ErrorString
              << std::endl;
  }
}

float getTime(hipEvent_t& EvBase, int Device) {
  std::cout << "called getTime for device " << Device << std::endl;
  check(hipSetDevice(Device),"hipSetDevice");
  hipEvent_t Event;
  float Milliseconds{0.0f};
  check(hipEventCreateWithFlags(&Event, hipEventDefault),"hipEventCreateWithFlags");
  check(hipEventRecord(Event),"hipEventRecord");
  check(hipEventSynchronize(EvBase),"hipEventSynchronize");
  check(hipEventSynchronize(Event),"hipEventSynchronize");
  check(hipEventElapsedTime(&Milliseconds, EvBase, Event),"hipEventElapsedTime");
  return Milliseconds;
}

int main() {
  hipEvent_t EvBase0;
  check(hipSetDevice(0),"hipSetDevice");
  check(hipEventCreate(&EvBase0),"hipEventCreate");
  check(hipEventRecord(EvBase0, 0),"hipEventRecord");

  hipEvent_t EvBase1;
  check(hipSetDevice(1),"hipSetDevice");
  check(hipEventCreate(&EvBase1),"hipEventCreate");
  check(hipEventRecord(EvBase1, 0),"hipEventRecord");

  float time0 = getTime(EvBase0, 0);
  float time1 = getTime(EvBase1, 1);

  std::cout << "time0: " << time0 << " ms\n"
            << "time1: " << time1 << " ms\n"
            << "errors: " << s_errors << std::endl;

  return 0;
}

Good catch:

Apptainer> HIP_VISIBLE_DEVICES=0 sycl-ls
Segmentation fault (core dumped)
Apptainer> HIP_VISIBLE_DEVICES=1 sycl-ls
[hip:gpu][hip:0] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60140.9]
Apptainer> HIP_VISIBLE_DEVICES= sycl-ls
Apptainer> 

I’m not sure these are two physical GPUs. This Aldebaran/MI210 has two dies apparently.

Do you still think it’s interesting to run the code you provided? If so I’ll do it later.

Thanks a lot for the idea

Interesting! Could it be picking up the same GPU twice with two versions of the driver, or two different ROCm versions? Or perhaps GPU0 is an integrated GPU on your CPU chipset? I have a machine with MI210 and it only shows one GPU:

$ rocm-smi


========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                    
====================================================================================================================
0       2     0x740f,   11308  36.0°C  55.0W  N/A, N/A, 0         1700Mhz  1600Mhz  0%   high  300.0W  0%     0%    
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================

$ sycl-ls
[hip:gpu][hip:0] AMD HIP BACKEND, AMD Instinct MI210 gfx90a:sramecc+:xnack- [HIP 60241.13]
[native_cpu:cpu][native_cpu:0] SYCL_NATIVE_CPU, SYCL Native CPU 0.1 [0.0.0]

This is with the 2024-11-19 DPC++ nightly and ROCm 6.2.3.

Could you share the output of rocm-smi -a?

Hm OK. That’s interesting.

Edit: the admin confirmed that my understanding was wrong: there are in fact two MI210 cards in this machine. Sorry for any misunderstanding.

I’m tempted to ask and have this machine power cycled. This could fix a hardware issue with the “first” GPU or even fix the two GPU situation. Is there anything you’d like to investigate while in this state? I guess that even if the hardware is faulty, the software should not crash.

Apptainer> rocm-smi -a


========================= ROCm System Management Interface =========================
=========================== Version of System Component ============================
Driver version: 5.14.0-427.42.1.el9_4.x86_64
====================================================================================
======================================== ID ========================================
GPU[0]		: GPU ID: 0x740f
GPU[1]		: GPU ID: 0x740f
====================================================================================
==================================== Unique ID =====================================
GPU[0]		: Unique ID: 0xf7126fa176ea44fb
GPU[1]		: Unique ID: 0x58706ff0c3973c40
====================================================================================
====================================== VBIOS =======================================
GPU[0]		: VBIOS version: 113-D67301-064D
GPU[1]		: VBIOS version: 113-D67301-064D
====================================================================================
=================================== Temperature ====================================
GPU[0]		: Temperature (Sensor edge) (C): 32.0
GPU[0]		: Temperature (Sensor junction) (C): 33.0
GPU[0]		: Temperature (Sensor memory) (C): 47.0
GPU[0]		: Temperature (Sensor HBM 0) (C): 44.0
GPU[0]		: Temperature (Sensor HBM 1) (C): 47.0
GPU[0]		: Temperature (Sensor HBM 2) (C): 45.0
GPU[0]		: Temperature (Sensor HBM 3) (C): 43.0
GPU[1]		: Temperature (Sensor edge) (C): 32.0
GPU[1]		: Temperature (Sensor junction) (C): 32.0
GPU[1]		: Temperature (Sensor memory) (C): 48.0
GPU[1]		: Temperature (Sensor HBM 0) (C): 48.0
GPU[1]		: Temperature (Sensor HBM 1) (C): 46.0
GPU[1]		: Temperature (Sensor HBM 2) (C): 45.0
GPU[1]		: Temperature (Sensor HBM 3) (C): 47.0
====================================================================================
============================ Current clock frequencies =============================
GPU[0]		: fclk clock level: 0: (400Mhz)
GPU[0]		: mclk clock level: 3: (1600Mhz)
GPU[0]		: sclk clock level: 1: (800Mhz)
GPU[0]		: socclk clock level: 3: (1090Mhz)
GPU[0]		: pcie clock level: 5 (2.5GT/s x16)
GPU[1]		: fclk clock level: 0: (400Mhz)
GPU[1]		: mclk clock level: 3: (1600Mhz)
GPU[1]		: sclk clock level: 1: (800Mhz)
GPU[1]		: socclk clock level: 3: (1090Mhz)
GPU[1]		: pcie clock level: 5 (2.5GT/s x16)
====================================================================================
================================ Current Fan Metric ================================
GPU[0]		: Unable to detect fan speed for GPU 0
GPU[1]		: Unable to detect fan speed for GPU 1
====================================================================================
============================== Show Performance Level ==============================
GPU[0]		: Performance Level: auto
GPU[1]		: Performance Level: auto
====================================================================================
================================= OverDrive Level ==================================
GPU[0]		: GPU OverDrive value (%): 0
GPU[1]		: GPU OverDrive value (%): 0
====================================================================================
================================= OverDrive Level ==================================
GPU[0]		: GPU Memory OverDrive value (%): 0
GPU[1]		: GPU Memory OverDrive value (%): 0
====================================================================================
==================================== Power Cap =====================================
GPU[0]		: Max Graphics Package Power (W): 300.0
GPU[1]		: Max Graphics Package Power (W): 300.0
====================================================================================
=============================== Show Power Profiles ================================
GPU[0]		: get_power_profiles, Not supported on the given system
GPU[1]		: get_power_profiles, Not supported on the given system
====================================================================================
================================ Power Consumption =================================
GPU[0]		: Average Graphics Package Power (W): 41.0
GPU[1]		: Average Graphics Package Power (W): 40.0
====================================================================================
=========================== Supported clock frequencies ============================
GPU[0]		: 
GPU[0]		: Supported fclk frequencies on GPU0
GPU[0]		: 0: 400Mhz *
GPU[0]		: 
GPU[0]		: Supported mclk frequencies on GPU0
GPU[0]		: 0: 400Mhz
GPU[0]		: 1: 700Mhz
GPU[0]		: 2: 1200Mhz
GPU[0]		: 3: 1600Mhz *
GPU[0]		: 
GPU[0]		: Supported sclk frequencies on GPU0
GPU[0]		: 0: 500Mhz
GPU[0]		: 1: 800Mhz *
GPU[0]		: 2: 1700Mhz
GPU[0]		: 
GPU[0]		: Supported socclk frequencies on GPU0
GPU[0]		: 0: 666Mhz
GPU[0]		: 1: 857Mhz
GPU[0]		: 2: 1000Mhz
GPU[0]		: 3: 1090Mhz *
GPU[0]		: 4: 1333Mhz
GPU[0]		: 
GPU[0]		: Supported PCIe frequencies on GPU0
GPU[0]		: 0: 2.5GT/s x1
GPU[0]		: 1: 2.5GT/s x2
GPU[0]		: 2: 2.5GT/s x4
GPU[0]		: 3: 2.5GT/s x8
GPU[0]		: 4: 2.5GT/s x12
GPU[0]		: 5: 2.5GT/s x16 *
GPU[0]		: 6: 5.0GT/s x1
GPU[0]		: 7: 5.0GT/s x2
GPU[0]		: 8: 5.0GT/s x4
GPU[0]		: 9: 5.0GT/s x8
GPU[0]		: 10: 5.0GT/s x12
GPU[0]		: 11: 5.0GT/s x16
GPU[0]		: 12: 8.0GT/s x1
GPU[0]		: 13: 8.0GT/s x2
GPU[0]		: 14: 8.0GT/s x4
GPU[0]		: 15: 8.0GT/s x8
GPU[0]		: 16: 8.0GT/s x12
GPU[0]		: 17: 8.0GT/s x16
GPU[0]		: 18: 16.0GT/s x1
GPU[0]		: 19: 16.0GT/s x2
GPU[0]		: 20: 16.0GT/s x4
GPU[0]		: 21: 16.0GT/s x8
GPU[0]		: 22: 16.0GT/s x12
GPU[0]		: 23: 16.0GT/s x16
GPU[0]		: 
------------------------------------------------------------------------------------
GPU[1]		: 
GPU[1]		: Supported fclk frequencies on GPU1
GPU[1]		: 0: 400Mhz *
GPU[1]		: 
GPU[1]		: Supported mclk frequencies on GPU1
GPU[1]		: 0: 400Mhz
GPU[1]		: 1: 700Mhz
GPU[1]		: 2: 1200Mhz
GPU[1]		: 3: 1600Mhz *
GPU[1]		: 
GPU[1]		: Supported sclk frequencies on GPU1
GPU[1]		: 0: 500Mhz
GPU[1]		: 1: 800Mhz *
GPU[1]		: 2: 1700Mhz
GPU[1]		: 
GPU[1]		: Supported socclk frequencies on GPU1
GPU[1]		: 0: 666Mhz
GPU[1]		: 1: 857Mhz
GPU[1]		: 2: 1000Mhz
GPU[1]		: 3: 1090Mhz *
GPU[1]		: 4: 1333Mhz
GPU[1]		: 
GPU[1]		: Supported PCIe frequencies on GPU1
GPU[1]		: 0: 2.5GT/s x1
GPU[1]		: 1: 2.5GT/s x2
GPU[1]		: 2: 2.5GT/s x4
GPU[1]		: 3: 2.5GT/s x8
GPU[1]		: 4: 2.5GT/s x12
GPU[1]		: 5: 2.5GT/s x16 *
GPU[1]		: 6: 5.0GT/s x1
GPU[1]		: 7: 5.0GT/s x2
GPU[1]		: 8: 5.0GT/s x4
GPU[1]		: 9: 5.0GT/s x8
GPU[1]		: 10: 5.0GT/s x12
GPU[1]		: 11: 5.0GT/s x16
GPU[1]		: 12: 8.0GT/s x1
GPU[1]		: 13: 8.0GT/s x2
GPU[1]		: 14: 8.0GT/s x4
GPU[1]		: 15: 8.0GT/s x8
GPU[1]		: 16: 8.0GT/s x12
GPU[1]		: 17: 8.0GT/s x16
GPU[1]		: 18: 16.0GT/s x1
GPU[1]		: 19: 16.0GT/s x2
GPU[1]		: 20: 16.0GT/s x4
GPU[1]		: 21: 16.0GT/s x8
GPU[1]		: 22: 16.0GT/s x12
GPU[1]		: 23: 16.0GT/s x16
GPU[1]		: 
------------------------------------------------------------------------------------
====================================================================================
================================ % time GPU is busy ================================
GPU[0]		: GPU use (%): 0
GPU[0]		: GFX Activity: 1873610
GPU[1]		: GPU use (%): 0
GPU[1]		: GFX Activity: 25615
====================================================================================
================================ Current Memory Use ================================
GPU[0]		: GPU memory use (%): 0
GPU[0]		: Memory Activity: 33385
GPU[1]		: GPU memory use (%): 0
GPU[1]		: Memory Activity: 45
====================================================================================
================================== Memory Vendor ===================================
GPU[0]		: GPU memory vendor: hynix
GPU[1]		: GPU memory vendor: hynix
====================================================================================
=============================== PCIe Replay Counter ================================
GPU[0]		: PCIe Replay Count: 0
GPU[1]		: PCIe Replay Count: 0
====================================================================================
================================== Serial Number ===================================
GPU[0]		: Serial Number: 692231000045
GPU[1]		: Serial Number: 692231000014
====================================================================================
================================== KFD Processes ===================================
No KFD PIDs currently running
====================================================================================
=============================== GPUs Indexed by PID ================================
No KFD PIDs currently running
====================================================================================
==================== GPU Memory clock frequencies and voltages =====================
GPU[0]		: OD_SCLK:
GPU[0]		: 0: 500Mhz
GPU[0]		: 1: 1700Mhz
GPU[0]		: OD_MCLK:
GPU[0]		: 1: 1600Mhz
GPU[0]		: OD_VDDC_CURVE:
GPU[0]		: 0: 0Mhz 0mV
GPU[0]		: 1: 0Mhz 0mV
GPU[0]		: 2: 0Mhz 0mV
GPU[0]		: OD_RANGE:
GPU[0]		: SCLK:     0Mhz        0Mhz
GPU[0]		: MCLK:     0Mhz        0Mhz
GPU[0]		: VDDC_CURVE_SCLK[0]:     0Mhz
GPU[0]		: VDDC_CURVE_VOLT[0]:     0mV
GPU[0]		: VDDC_CURVE_SCLK[1]:     0Mhz
GPU[0]		: VDDC_CURVE_VOLT[1]:     0mV
GPU[0]		: VDDC_CURVE_SCLK[2]:     0Mhz
GPU[0]		: VDDC_CURVE_VOLT[2]:     0mV
GPU[1]		: OD_SCLK:
GPU[1]		: 0: 500Mhz
GPU[1]		: 1: 1700Mhz
GPU[1]		: OD_MCLK:
GPU[1]		: 1: 1600Mhz
GPU[1]		: OD_VDDC_CURVE:
GPU[1]		: 0: 0Mhz 0mV
GPU[1]		: 1: 0Mhz 0mV
GPU[1]		: 2: 0Mhz 0mV
GPU[1]		: OD_RANGE:
GPU[1]		: SCLK:     0Mhz        0Mhz
GPU[1]		: MCLK:     0Mhz        0Mhz
GPU[1]		: VDDC_CURVE_SCLK[0]:     0Mhz
GPU[1]		: VDDC_CURVE_VOLT[0]:     0mV
GPU[1]		: VDDC_CURVE_SCLK[1]:     0Mhz
GPU[1]		: VDDC_CURVE_VOLT[1]:     0mV
GPU[1]		: VDDC_CURVE_SCLK[2]:     0Mhz
GPU[1]		: VDDC_CURVE_VOLT[2]:     0mV
====================================================================================
================================= Current voltage ==================================
GPU[0]		: Voltage (mV): 793
GPU[1]		: Voltage (mV): 793
====================================================================================
==================================== PCI Bus ID ====================================
GPU[0]		: PCI Bus: 0000:23:00.0
GPU[1]		: PCI Bus: 0000:83:00.0
====================================================================================
=============================== Firmware Information ===============================
GPU[0]		: ASD firmware version: 	0x00000000
GPU[0]		: CE firmware version: 		0
GPU[0]		: DMCU firmware version: 	0
GPU[0]		: MC firmware version: 		0
GPU[0]		: ME firmware version: 		0
GPU[0]		: MEC firmware version: 	78
GPU[0]		: MEC2 firmware version: 	78
GPU[0]		: PFP firmware version: 	0
GPU[0]		: RLC firmware version: 	17
GPU[0]		: RLC SRLC firmware version: 	0
GPU[0]		: RLC SRLG firmware version: 	0
GPU[0]		: RLC SRLS firmware version: 	0
GPU[0]		: SDMA firmware version: 	8
GPU[0]		: SDMA2 firmware version: 	8
GPU[0]		: SMC firmware version: 	00.68.59.00
GPU[0]		: SOS firmware version: 	0x00270082
GPU[0]		: TA RAS firmware version: 	27.00.01.60
GPU[0]		: TA XGMI firmware version: 	32.00.00.19
GPU[0]		: UVD firmware version: 	0x00000000
GPU[0]		: VCE firmware version: 	0x00000000
GPU[0]		: VCN firmware version: 	0x0110101b
GPU[1]		: ASD firmware version: 	0x00000000
GPU[1]		: CE firmware version: 		0
GPU[1]		: DMCU firmware version: 	0
GPU[1]		: MC firmware version: 		0
GPU[1]		: ME firmware version: 		0
GPU[1]		: MEC firmware version: 	78
GPU[1]		: MEC2 firmware version: 	78
GPU[1]		: PFP firmware version: 	0
GPU[1]		: RLC firmware version: 	17
GPU[1]		: RLC SRLC firmware version: 	0
GPU[1]		: RLC SRLG firmware version: 	0
GPU[1]		: RLC SRLS firmware version: 	0
GPU[1]		: SDMA firmware version: 	8
GPU[1]		: SDMA2 firmware version: 	8
GPU[1]		: SMC firmware version: 	00.68.59.00
GPU[1]		: SOS firmware version: 	0x00270082
GPU[1]		: TA RAS firmware version: 	27.00.01.60
GPU[1]		: TA XGMI firmware version: 	32.00.00.19
GPU[1]		: UVD firmware version: 	0x00000000
GPU[1]		: VCE firmware version: 	0x00000000
GPU[1]		: VCN firmware version: 	0x0110101b
====================================================================================
=================================== Product Info ===================================
GPU[0]		: Card series: 		Instinct MI210
GPU[0]		: Card model: 		0x0c34
GPU[0]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]		: Card SKU: 		D67301
GPU[1]		: Card series: 		Instinct MI210
GPU[1]		: Card model: 		0x0c34
GPU[1]		: Card vendor: 		Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1]		: Card SKU: 		D67301
====================================================================================
==================================== Pages Info ====================================
====================================================================================
============================== Show Valid sclk Range ===============================
GPU[0]		: Valid sclk range: 500Mhz - 1700Mhz
GPU[1]		: Valid sclk range: 500Mhz - 1700Mhz
====================================================================================
============================== Show Valid mclk Range ===============================
GPU[0]		: Valid mclk range: 400Mhz - 1600Mhz
GPU[1]		: Valid mclk range: 400Mhz - 1600Mhz
====================================================================================
============================= Show Valid voltage Range =============================
ERROR: GPU[0]	: Voltage curve regions unsupported.
ERROR: GPU[1]	: Voltage curve regions unsupported.
====================================================================================
=============================== Voltage Curve Points ===============================
GPU[0]		: Voltage point 0: 0Mhz 0mV
GPU[0]		: Voltage point 1: 0Mhz 0mV
GPU[0]		: Voltage point 2: 0Mhz 0mV
GPU[1]		: Voltage point 0: 0Mhz 0mV
GPU[1]		: Voltage point 1: 0Mhz 0mV
GPU[1]		: Voltage point 2: 0Mhz 0mV
====================================================================================
================================= Consumed Energy ==================================
GPU[0]		: Energy counter: 4202582676223
GPU[0]		: Accumulated Energy (uJ): 64299515747790.93
GPU[1]		: Energy counter: 4183076740630
GPU[1]		: Accumulated Energy (uJ): 64001074929497.57
====================================================================================
============================ Current Compute Partition =============================
GPU[0]		: Not supported on the given system
GPU[1]		: Not supported on the given system
====================================================================================
================================= Current NPS Mode =================================
GPU[0]		: Not supported on the given system
GPU[1]		: Not supported on the given system
====================================================================================
=============================== End of ROCm SMI Log ================================

This certainly looks like two separate MI210 GPUs as they have different serial numbers and PCIe bus ID. I think it’s a good call to power-cycle the machine or maybe even check the physical connection to the PCIe slot.

You could still try to compile that simple program above with hipcc and run it. If that also crashes on accessing GPU0, it would confirm the problem isn’t related to SYCL runtime.

From the posted code, any call to device 0 triggers a segfault. Same on device 1 works fine.

Thanks a lot helping me investigate this.