Root_group support?

Jorgen · 18 June 2025 23:54

I’m trying to use root_group experimental extension (docs link). I’ve looked at the experimental features in the docs (link) for the CUDA plugin, and I can’t see that root_group is mentioned, but sycl_ext_oneapi_non_uniform_groups is supported, which gave me hope considering it mentions root_group. But does not seem to work for me.
I’ve checked my CUDA version and that the macro is defined using the following code:

    for(auto& plat : sycl::platform::get_platforms())
    {
        std::cout << "CUDA‐SYCL platform name: " << plat.get_info<sycl::info::platform::name>() << "\n"
                  << "Reported version:     " << plat.get_info<sycl::info::platform::version>() << "\n";
    }

#ifdef SYCL_EXT_ONEAPI_ROOT_GROUP
    printf("ROOT GPOUP SUPPPOERTED\n");
#else
    printf("ROOT GROUP NOT SUPPORTED\n");
#endif

// Resulting printout:
/*
CUDA‐SYCL platform name: Intel(R) OpenCL
Reported version:     OpenCL 3.0 LINUX
CUDA‐SYCL platform name: NVIDIA CUDA BACKEND
Reported version:     CUDA 12.8
ROOT GPOUP SUPPPOERTED
*/

So the macro say it’s supported but that could be for the CPU target.

I tested it using these lines

    namespace syclexp = sycl::ext::oneapi::experimental;

    auto props = syclexp::properties{syclexp::use_root_sync};

// ...
                cgh.parallel_for(sycl::nd_range{sg_region.global, sg_region.local},
                                 props,
                                 kernel());

// ... Inside the kernel
        auto root = it.ext_oneapi_get_root_group(); 

        sycl::group_barrier(root);

This results in the following error:

terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  Native API failed. Native API returns: 66 (UR_RESULT_ERROR_ADAPTER_SPECIFIC)
This feature requires cuda 11.8 or later.

zsh: IOT instruction (core dumped)

So I wonder if I’m doing something wrong or if my install is not working properly or that it is simply not supported? I would love some feedback on this.

In the case that it’s not supported, am I better off using atomic add and a spinlock, or are there other good options for synchronising a wave (my kernel is configured to be a complete wave or less)

duncan · 23 June 2025 18:39

Hi @Jorgen ,

I’ve been able to reproduce this, but haven’t been able to debug why the feature isn’t working. I’ll take another look tomorrow and see if it is simply that the hardware doesn’t support the feature, and this is being misreported. Can I ask what GPU you’re trying to run this on?

Duncan.

Jorgen · 23 June 2025 19:51

Hi @Duncan,

I’ve tried to run it on both rtx 3070 ti running arch with open nvidia drivers and a rtx4080 running ubuntu with propritetarty drivers (I believe). It results in the same error message on both computers and the SYCL_EXT_ONEAPI_ROOT_GROUP macro is defined in both cases as well, but I guess that could be due to CPU or overall compiler support for the feature?

Best regards,
Jorgen

duncan · 24 June 2025 12:30

That’s good to know, thanks. I’ll dig into things here and see if I can find out what error is being triggered and why.

The macro is indeed just set by the implementation supporting the feature, it’s not per backend enabled in the compiler.

duncan · 24 June 2025 15:30

OK I think I have figured out the problem: there is support in the code for this but it relies on code that is built as part of the oneAPI distribution, not on the stuff that is built when you’re building your code. There are two ways I can see to fix it.

If you are familiar with LLVM, you can try checking out the Intel fork of LLVM and building that, then using that as your compiler. It’s very similar to icpx though there are a lot of additions inside icpx that aren’t in the open source. When you build this, the code in question (llvm/unified-runtime/source/adapters/cuda/enqueue.cpp at f6a9c4d11e1a772f9bab6799196c9217c4dd1b08 · intel/llvm · GitHub) will see the newer CUDA version and work fine (this is what I did).

The alternative is to build just the component you require, which is the Unified Runtime CUDA adapter. This can be done in two ways: one is as part of the LLVM build. Instead of building all of LLVM, you can simply do something like

git clone git@github.com:intel/llvm
git checkout sycl-rel-6_1_0
cd llvm && python buildbot/configure.py --cuda
ninja -C build libur_adapter_cuda.so

This should build the one component that you need to be updated (and will build the matching version for 2025.1). If you make sure that this version of the library is on the library path before the version shipped with oneAPI, it should work.

It might also be possible to download and build Unified Runtime from github, as there’s a mirror there. Let me know if you’d like to try that option (as I haven’t yet tried it myself).

I hope this helps.
Duncan.

duncan · 2 July 2025 13:43

Hi @Jorgen,

Were you able to give either of these a try? Did it help?

Duncan.

Topic		Replies	Views
Will it work with NVIDIA GeForce RTX 3090? oneAPI for NVIDIA GPUs	2	159	21 May 2024
NVIDIA CUDA OpenCL platform found but is not compatible oneAPI for NVIDIA GPUs	1	67	28 November 2024
CUDA_ERROR_UNKNOWN 999 when verifying installation using sycl-ls on NVIDIA GeForce RTX 3080 Ti oneAPI for NVIDIA GPUs	7	236	31 May 2024
OneMKL / NVIDIA	10	1069	28 November 2023
Failed to run a sample application oneAPI for AMD GPUs	6	130	18 December 2024

Root_group support?

Related topics