Work-group local accessors for hierarchical kernels

I have this example:

#include <CL/sycl.hpp>
#include <iostream>
int main()
{
    sycl::queue q(sycl::gpu_selector_v);
    auto device = q.get_device();
    auto max_wg_size = device.get_info<sycl::info::device::max_work_group_size>();
    auto N0 = 1;
    auto N1 = max_wg_size;
    auto N2 = 1;
    auto buf_range = sycl::range<3>(N0, N1, N2);
    sycl::buffer<uint8_t, 3> buf(buf_range);
    for (int wg_size = 0; wg_size < max_wg_size; wg_size++)
    {
        q.submit([&](sycl::handler &h)
                 {
        auto acc = buf.template get_access<sycl::access::mode::read_write>(h);
        h.parallel_for_work_group(sycl::range<1>(N0), sycl::range<1>(wg_size), [=](sycl::group<1> gr)
        {
            auto group_id = gr.get_id(0);
            gr.parallel_for_work_item([&](sycl::h_item<1> h)
            {
                auto local_id = h.get_local_id(0);
                for(int i = 0; i < N2; i++)
                {
                    acc[group_id][local_id][i] = 1;
                }

            });
        }); })
            .wait();
    }
    return 0;
}

Which tries to perform an assignment on wg_size work items in a single compute unit. This results in the following error on my RTX3060, when wg_size = 897. (< max_wg_size = 1024)

PI CUDA ERROR:
        Value:           701
        Name:            CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
        Description:     too many resources requested for launch
        Function:        cuda_piEnqueueKernelLaunch
        Source Location: /root/intel-llvm-mirror/sycl/plugins/cuda/pi_cuda.cpp:3179

The total buffer size is fairly small, so why isn’t the GPU able to handle the load?

Is it possible to use any of the device info to determine the true maximum work group size available?

GPU-info:

name: NVIDIA GeForce RTX 3060 Ti
vendor: NVIDIA Corporation
version: 0.0
max_compute_units: 38
max_work_group_size: 1024
max_clock_frequency: 1695
global_mem_size: 8360755200
local_mem_size: 49152
max_mem_alloc_size: 1073741824
global_mem_cache_size: 3145728
global_mem_cacheline_size: 0
max_work_item_sizes_1D: 1024
max_work_item_sizes_2D: 1024, 1024
max_work_item_sizes_3D: 64, 1024, 1024

Hi,

We can’t reproduce this with other GPUs, although we don’t have a 3060 Ti to test on. Which version of dpc++ are you using? And how did you compile the program?

Please also make sure that your gpu is free of other processes before running the program by checking that nvidia-smi shows no processes.

Full system info:

Linux man 6.1.44-1-MANJARO #1 SMP PREEMPT_DYNAMIC Wed Aug 9 09:02:26 UTC 2023 x86_64 GNU/Linux

Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm

Displays are rendered on a GTX1060, which leaves the RTX3060 completely free.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98                 Driver Version: 535.98       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1060 6GB    Off | 00000000:08:00.0  On |                  N/A |
| 65%   55C    P0              36W / 140W |   1250MiB /  6144MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3060 Ti     Off | 00000000:09:00.0 Off |                  N/A |
|  0%   45C    P8              12W / 200W |     10MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Executable is compiled with:

source /opt/intel/oneapi/setvars --include-intel-llvm
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda compute_units.cpp

The GTX1060 is able to run the hierarchical kernels with max compute (N0) and work group (N1) sizes.

nvcc-compiler:

NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Hi,

Please update to the latest oneapi version/cuda plugin available here: https://developer.codeplay.com/

Hopefully that will work on the 3060.

The issue still persists under the newest versions of Intel OneAPI DPC++, Nvidia for OneAPI(2023.2.1) and Cuda (12.2).

It still works fine with the reduced work item range.

I’ve now tried it with CUDA 12.2 on an a100. Someone else tried it with an earlier cuda driver version on a 3060 (not ti). In both cases the test passes. It is strange that it would only occur on the 3060 ti.

Thanks for reporting this issue. We will keep it in mind, but it is harder for us to fix an issue we can’t reproduce.

Note that hierarchical kernels are not recommended currently: From the SYCL 2020 spec:

“Based on developer and implementation feedback, the hierarchical data parallel kernel
feature described next is undergoing improvements to better align with the frameworks
and patterns prevalent in modern programming. As this is a key part of the SYCL API
and we expect to make changes to it, we temporarily recommend that new codes refrain
from using this feature until the new API is finished in a near-future version of the SYCL
specification, when full use of the updated feature will be recommended for use in new
code. Existing codes using this feature will of course be supported by conformant implementations of this specification.”

1 Like

I find the hierarchical kernels to be the most clear relation between the kernel tasks specified and the threads initiated on the GPU.

Say I was to enqueue all my threads in a h.parallel_for(N_sims, ...) instead, I would not have the possibility to index work-group local memory, since sycl::item<> does not have get_global_id(), get_local_id().

I have also considered the alternative of submitting N_compute parallel_for-kernels with the hope of achieving concurrent execution, but the kernel submissions (for some reason) does not spread out over the compute units.

This is apparent when running Nvidia Nsight Compute on the following example, which enqueues N_compute parallel_for kernels:

#include <CL/sycl.hpp>
#include <numeric>
#include <chrono>
#include <iostream>


auto single_wg_kernel = [](auto& buf, auto max_wg_size, auto offset){ return [&, max_wg_size, offset](sycl::handler &h)
                        {
        auto acc = sycl::accessor<int, 1, sycl::access_mode::read_write>(buf, h, sycl::range<1>(max_wg_size), sycl::range<1>(offset));
        auto loc_acc = sycl::accessor<int, 1, sycl::access::mode::read_write, sycl::access::target::local>(sycl::range<1>(max_wg_size), h);
        h.parallel_for<class single_wg_kernel>(max_wg_size, [=](sycl::item<1> it)
        {
            loc_acc[it[0]] = 0;

            //arbitrary local-memory operation

            for(auto i = 0; i < 10e4; i++)
            {
                loc_acc[it[0]] += 1;
            }

            acc[it[0]] = loc_acc[it[0]];
        }); };};

using hclock = std::chrono::high_resolution_clock;
using namespace std::chrono;

int main()
{
    sycl::queue q(sycl::gpu_selector_v);


    auto device = q.get_device();
    auto max_wg_size = device.get_info<sycl::info::device::max_work_group_size>();
    auto N_compute_units = device.get_info<sycl::info::device::max_compute_units>();

    auto N_sims = N_compute_units * max_wg_size;
    sycl::buffer<int, 1> buf((sycl::range<1>(N_sims)));


    std::vector<sycl::event> events(N_compute_units);
    auto qstart = hclock::now();

    std::generate_n(events.begin(), N_compute_units, [&, idx = 0]() mutable
                    {
        auto offset = max_wg_size * idx;
        idx++;
        return q.submit(single_wg_kernel(buf, max_wg_size, offset)); });


    auto qend = hclock::now();
    std::cout << "Time to submit " << N_compute_units << " kernels: " << duration_cast<milliseconds>(qend - qstart).count() << " ms" << std::endl;
    std::vector<int> result(N_sims);
    auto cpy_event = q.submit([&](sycl::handler &h)
             {
        h.depends_on(events);
        auto acc = buf.template get_access<sycl::access::mode::read>(h);
        h.copy(acc, result.data()); });


    cpy_event.wait();
    std::cout << "Wait time: " << duration_cast<milliseconds>(hclock::now() - qend).count() << " ms" << std::endl;

    assert(std::all_of(result.begin(), result.end(), [](auto x)
                       { return x == 10e4; }));

    return 0;
}

(Nsight Compute gives the following recommendation)

The grid for this launch is configured to execute only 2 blocks, which is less than the GPU’s 38 multiprocessors. This can underutilize some multiprocessors. If you do not intend to execute this kernel concurrently with other workloads, consider reducing the block size to have at least one block per multiprocessor or increase the size of the grid to fully utilize the available hardware resources…

Are there any ways to keep track of work-group local memory with a guaranteed concurrent execution on multiple compute units?

Lots to unpack here. But first, be aware that if you use

sycl::buffer<int, 1> buf((sycl::range<1>(N_sims)));

and then submit a bunch of

q.submit(single_wg_kernel(buf, max_wg_size, offset));

that are all operating on buf, the sycl runtime introduces a dependency on consecutive kernel submissions that the previous submission is complete. So all of these kernel submissions will be serialized, which it sounds like you want to avoid from your comment above. I will answer fully later, and I don’t think this is the best solution for you, but this buffer issue I think answers one part of your question. Note that if you used usm instead of buffers then the sycl runtime would not introduce this dependency and the kernel executions would not be serialized. This is all discussed in the sycl 2020 spec.

By the way, running an N_sim = N_compute*max_wg_size parallel_for kernel also caused the RTX3060 ti to fail (but not the GTX1060), so this is likely a software version/hardware issue on my end.

Have you tried using an nd_range kernel? I think this is the answer.

1 Like

ND-range kernels are just the thing I want, but these kernels doesn’t seem to spread out across the compute units either. When running this example:

#include <CL/sycl.hpp>
#include <numeric>
#include <iostream>


auto single_wg_kernel = [](auto& p_buf, auto range){ return [&, range](sycl::handler &h)
                        {
        auto loc_acc = sycl::accessor<int, 2, sycl::access::mode::read_write, sycl::access::target::local>(range, h);
        h.parallel_for<class single_wg_kernel>(sycl::nd_range<2>(range, range), [=](sycl::nd_item<2> it)
        {
            auto gid = it.get_global_linear_id();
            auto lid = it.get_local_id();
            loc_acc[lid] = 0;
            for(int i = 0; i < 10e4; i++)
            {
                loc_acc[lid] += 1;
            }

            p_buf[gid] = loc_acc[lid];
        }); };};


int main()
{

    auto q = sycl::queue(sycl::gpu_selector_v);

    auto device = q.get_device();
    auto N_compute_units = device.get_info<sycl::info::device::max_compute_units>();
    // 14x14x14x14 is roughly under the max compute unit and work group size of RTX3060 Ti
    auto range = sycl::range<2>(14, 14);

    auto N_sims = range[0]*range[1]*range[0]*range[1];
    auto p_buf = sycl::malloc_shared<int>(N_sims, q);

    std::vector<sycl::event> events(N_compute_units);
    auto i = 0;

    auto event = q.submit(single_wg_kernel(p_buf, range));

    std::vector<int> result(N_sims);
    event.wait();

    for(int i = 0; i < N_sims; i++)
    {
        result[i] = p_buf[i];
    }

    return 0;
}

The nsight profiler detects that the kernel only utilizes 1 compute unit, regardless of the dimensions specified for the nd-range.

Running multiple parallel_for kernel submissions with USM memory did not resolve this issue either. However, running a single parallel_for with USM results in a successfull workload distribution across all of the compute units on the RTX3060 Ti.

Will device-allocated USM memory perform efficiently with a single parallel_for?

Will it try to utilize local memory within the compute units?

I think the problem may be that you are misunderstooding the meaning of nd_range: you have specified:

auto range = sycl::range<2>(14, 14);

which is using as both the local and global range in the nd_range as:

sycl::nd_range<2>(range, range)

This means you are telling the runtime to execute a kernel with workgroup size 14*14 and to have only one work-group. If you example you wanted 10 such work-groups you would have instead:

sycl::nd_range<2>(sycl::range<2>(14*10, 14*10), sycl::range<2>(14, 14))

Does that make sense?
In answer to your other question above, if you use USM then you still need to use a local_accessor in the way you have done above if you want to access cuda shared memory. This is currently the only way that cuda shared (sycl local) memory is accessible in sycl 2020.

This resource may be useful to you https://github.com/codeplaysoftware/syclacademy/tree/main/Code_Exercises/Exercise_14_ND_Range_Kernel
Which documentation did you read initially? Do you have any feedback on this documentation that could lead to its improvement? Where there certain parts that led to confusion? Thanks

1 Like

That clears up a lot!
I initially tested nd-range kernels on a device supporting non-uniform workload distributions, which enabled the global range to be assigned to sycl::range<1>(N_compute) instead of sycl::range<1>(N_compute*N_workgroup).

I followed the Revision 7 manual on this one (which has a clear definition of global range), so that’s my mistake. I thought there was some connection between the issues with the global range and the hierarchical RTX3060 Ti kernel issue, but the RTX3060 Ti achieves full throughput on all 38 cores with the correct nd-range.

Thanks for the help!

No problem. We will try to improve the documentation.

You may find that using the free DPC++ (Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL | SpringerLink) book is a better guide in many cases than the specification. You may also find sycl academy (GitHub - codeplaysoftware/syclacademy: SYCL Academy, a set of learning materials for SYCL heterogeneous programming) useful. For plugin specific issues we also have backend specific guides that we are actively improving.

1 Like

For example here is the cuda backend one: Install oneAPI for NVIDIA GPUs - Guides - oneAPI for NVIDIA® GPUs - Products - Codeplay Developer. The troubleshooting section may be particularly useful.

1 Like