SYCL kernel hangs and never finishes

I have an issue where my SYCL kernels hangs and never exits after it’s been computing for some time (~10s). This only happens when sycl::gpu_selector_v is selected for the sycl::queue command queue. Using sycl::cpu_selector_v shows no issues.
The kernel only hangs when there is a lot of computations to be done (see the details about LOOP_ITERATION and N below). One detail that might be relevant is that at first, just after launching the program, my laptop is barely usable due to the integrated GPU being used at its maximum. After a few seconds though, my laptop becomes usable again (as if it weren’t computing anything anymore) but the program still is running (and will never stop). At that point, my CPU will show a usage of ~50% (when using sycl::gpu_selector_v) until I decide to manually stop the program.

I managed to reproduce this issue on a simple example:

#include <sycl/sycl.hpp>
#include <vector>

#define LOOP_ITERATION 10000000
#define N 1000000

int main()
{
    std::vector<float> v(N);

    sycl::queue q{sycl::gpu_selector_v};
    sycl::buffer buf{v};

    q.submit([&](sycl::handler& cgh)
    {
        auto acc {buf.get_access(cgh,sycl::read_write)};

        cgh.parallel_for(N, [=](sycl::id<1> id)
        {
            float x = 0.0f;
            for (int i = 0; i < LOOP_ITERATION; i++)
                x += i / 2;

            acc[id] = x;
        });
    }).wait();

    std::cout << "Done!" << std::endl;

    return 0;
}

With the code posted above, I can only get the kernel to hang when N * LOOP_ITERATION is > 1 000 000 * 10 000 000. However, the kernel can still hang with lower LOOP_ITERATION (or N) values if we increase the complexity of the code inside the for (int i = 0; i < LOOP_ITERATION; i++) loop:

#define LOOP_ITERATION (10000000 / 10) //10 times less iterations
#define N 1000000

cgh.parallel_for(N, [=](sycl::id<1> id)
        {
            float x = 0.0f;
            for (int i = 0; i < LOOP_ITERATION; i++)
            {
                x += i / 2;

                float cosine = sycl::cos(sycl::sqrt(x));
                float sine = sycl::sin(x);
                float length = sycl::sqrt(cosine * cosine + sine * sine);

                x /= sycl::cos(length) * sycl::sin(length);
            }

            acc[id] = x;
        });

With LOOP_ITERATION divided by 10, the kernel never hangs unless the LOOP_ITERATION loop becomes more computationally demanding.

Sometimes but not always, this runtime error is thrown:

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -14 (PI_ERROR_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST) -14 (PI_ERROR_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST)

Here is a pastebin of the result of clinfo on my system if this is relevant.

Hi @Adhesive_Bagels,
I have to imagine in this case that the problem is due to the GPU hangcheck: https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-0/gpu-disable-hangcheck.html

I would say that normally it is unusual to have such a long-running kernel (I would expect most workloads to take more on the order of tens of milliseconds or less) but there are of course valid cases where you might need an increased running time.

Hi @duncan,

Disabling the GPU hangcheck sounded like the perfect solution to the issue but unfortunately, even after disabling the hangcheck (across reboots alternative), my kernel still hangs after some time. I don’t seem to be getting the

terminate called after throwing an instance of 'sycl::_V1::runtime_error'
  what():  Native API failed. Native API returns: -14 (PI_ERROR_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST) -14 (PI_ERROR_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST)

error anymore though.

I also tried writing N to /sys/module/i915/parameters/enable_hangcheck and the solutions proposed here but none of this changed anything either…

Hi @Adhesive_Bagels,
I could reproduce what you see and managed to confirm that the inconsistency in the behaviour comes from an interference between your application and the display manager running with the same iGPU context. When I run this example on a machine without a display (thus nothing else running on the iGPU), then I get the SYCL runtime error reproducibly every time.

Unfortunately, I was also not able to stop the GPU hangs from happening with the recommended ways to disable the check. With the check disabled, I still see in my system logs:

Oct 23 16:38:53 myhostname kernel: Fence expiration time out i915-0000:00:02.0:hang[4632]:4!
Oct 23 16:38:53 myhostname kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
Oct 23 16:38:53 myhostname kernel: i915 0000:00:02.0: [drm] hang[4632] context reset due to GPU hang
Oct 23 16:38:53 myhostname kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:8ed9fff3, in hang [4632]

I’m afraid my expertise on Intel GPU driver features and settings ends here. If you want to get to the bottom of this, I can only suggest trying in Intel support forums (maybe here) or elsewhere. However, your example code is quite an extreme case and normally kernels would not be expected to run such long workloads (hence the hang check). Do you have an actual use case for running this kind of work? Perhaps the work could be divided in a way that is better suited for running on the hardware you have?

Hi @rbielski,

I will try asking on the Intel forum and see if I can find a solution there, thanks.

I am currently writing a ray tracing application and as the number of samples increases, the kernel obviously takes longer to complete. I could (and that’s what I’ve been doing to avoid the issue so far), instead of calling 1 ray tracing kernel of 1024 samples, call 2 kernels of 512 samples each (or even 4 kernels of 256, …), effectively reducing the time taken by 1 kernel but this doesn’t feel like the right solution, more like a workaround. In the meantime and for my immediate use case, that will be sufficient.

I think you might run into serious performance issues if you try to brute-force the whole scene inside one giant kernel launch. I would strongly recommend running a subset of the rays per kernel launch. (You can use a 2d nd_range to split the image that way, for example).

There are other great resources that can describe how to stratify sampling and accumulate those samples into an image that is generated progressively. I think we even published some blogs on it:

Hi @duncan,

I thought that maybe avoiding the overhead of launching kernels, getting the accessors to the buffers etc… would be beneficial, hence the “maximum” number of samples per kernel so that the kernel can do as much work as possible without having to communicate with the CPU.

From a hardware perspective, why can launching one giant kernel be a concern for performances?

Generally speaking yes it is better (where possible) to pack as much work into one kernel launch as you can. That being said, you have to think about the performance of the actual kernels as well, and I would worry in this case that because rays can bounce effectively anywhere across the screen, you might end up with problems with divergent control flow (i.e. work-items in the same work group take different branches in an if which is costly) and memory access patterns (e.g. the ray-intersection tests are happening on entirely different models in the scene). You want work groups to be accessing the same memory and taking the same branches if you can make it happen.

1 Like