Not every item of nd_range<2> is executed with handler.parallel_for(range, kernel_class)

Adhesive_Bagels · 14 October 2023 21:18

I started writing a ray tracer using SYCL and I’m still learning the basics. My current issue is that, when trying to use a kernel class to organize my code, calling handler.parallel_for() on my range AND on my kernel class doesn’t seem to be executing every item of the range.

#include <CL/sycl.hpp>

#include "render_kernel.h"
#include "triangle.h"

int main(int argc, char* argv[])
{
    const int width = 128;
    const int height = 128;

    sycl::queue queue;

    std::cout << "Using " << queue.get_device().get_info<sycl::info::device::name>() << std::endl;

    Image image(width, height);

    std::vector<Triangle> triangles_buffer_host;
    triangles_buffer_host.push_back(Triangle(Point(0, 0, -2), Point(1, 0, -2), Point(0.5, 0.5, -2.0)));

    sycl::buffer<Color> image_buffer(image.color_data(), image.width() * image.height());
    sycl::buffer<Triangle> triangle_buffer(triangles_buffer_host.data(), triangles_buffer_host.size());

    queue.submit([&] (sycl::handler& handler) {
        auto image_buffer_access = image_buffer.get_access<sycl::access::mode::write>(handler);
        auto triangle_buffer_access = triangle_buffer.get_access<sycl::access::mode::read>(handler);

        const auto global_range = sycl::range<2>(width, height);
        const auto local_range = sycl::range<2>(8, 8);
        const auto coordinates_indices = sycl::nd_range<2>(global_range, local_range);

        sycl::stream debug_out_stream(1024, 128, handler);

        auto render_kernel = RenderKernel(width, height,
                                          image_buffer_access,
                                          triangle_buffer_access,
                                          debug_out_stream);

        handler.parallel_for(coordinates_indices, render_kernel);

        // The commented code below works as expected and the whole image is red
//      handler.parallel_for(coordinates_indices, [image_buffer_access] (sycl::nd_item<2> coordinates)
//      {
//          int x = coordinates.get_global_id(0);
//          int y = coordinates.get_global_id(1);

//          image_buffer_access[y * width + x] = Color(1.0, 0.0, 0.0);
//      });
    }).wait();

    return 0;
}

render_kernel.h:

#include <CL/sycl.hpp>

#include "color.h"
#include "triangle.h"

class RenderKernel
{
public:
    RenderKernel(int width, int height,
                 sycl::accessor<Color, 1, sycl::access::mode::write, sycl::access::target::device> frame_buffer_accessor,
                 sycl::accessor<Triangle, 1, sycl::access::mode::read, sycl::access::target::device> triangle_buffer_accessor,
                 sycl::stream debug_out_stream) :
        m_frame_buffer_access(frame_buffer_accessor),
        m_triangle_buffer_access(triangle_buffer_accessor),
        m_out_stream(debug_out_stream) {}

    SYCL_EXTERNAL void operator()(const sycl::nd_item<2>& coordinates) const
    {
        int x = coordinates.get_global_id(0);
        int y = coordinates.get_global_id(1);
        m_frame_buffer_access[y * m_width + x] = Color(1.0, 0.0, 0.0);
    }


private:
    int m_width, m_height;

    sycl::accessor<Color, 1, sycl::access::mode::write, sycl::access::target::device> m_frame_buffer_access;
    sycl::accessor<Triangle, 1, sycl::access::mode::read, sycl::access::target::device> m_triangle_buffer_access;
    sycl::stream m_out_stream;
};

Executing my kernel with a lambda function like so:

handler.parallel_for(coordinates_indices, [image_buffer_access] (sycl::nd_item<2> coordinates)
{
    int x = coordinates.get_global_id(0);
    int y = coordinates.get_global_id(1);

    image_buffer_access[y * width + x] = Color(1.0, 0.0, 0.0);
});

does work as expected. It really is when I’m using my kernel class as a functor that only a few items get executed.

Here is a result image as an example:
test

What could be the source of the issue?

Adhesive_Bagels · 15 October 2023 08:10

It turns out that this was because I had forgotten to initialize the m_width and m_height attributes in the constructor of RenderKernel.

m_frame_buffer_access[y * m_width + x] = Color(1.0, 0.0, 0.0);

was then using the garbage value m_width.

The right number of items do get executed whether or not m_width is correctly initialized. I was just mislead by my sycl::stream debug_out_stream because it only has a buffer size of 1024 so when I used it to print the IDs of the threads spawned in the kernel, it only showed a bunch until the buffer was eventually full and not able to show any more, leading me to think that only a fraction of the threads were being executed.

duncan · 16 October 2023 13:42

Hi @Adhesive_Bagels, thanks for posting a question (and answer!). Did the compiler emit a warning for the m_width variable? Similarly, was the debug out stream helpful in this case or a bit confusing? We’re always looking for ways we can improve either the specification or implementations.

Adhesive_Bagels · 17 October 2023 13:24

Hi @duncan

The compiler is not emitting any warning when removing the inline initialization of m_width (default compiler flags).

As for the sycl::stream, it was only confusing because of the limit of its buffer size but that’s probably more of me problem being new with SYCL rather than an implementation issue…

duncan · 17 October 2023 14:04

Thanks for the information. We could still do better in both cases, I think, firstly with m_width being used while uninitialised there could be a warning, and perhaps the stream could somehow warn that it has been filled (or has been written to lots while full).

Topic		Replies	Views
Using nd_range with parallel_for does not work	8	1852	27 May 2021
Problems with 'parallel_for'. ComputeCpp does not recognize "parallel_for"	3	866	8 March 2021
Creating variable length array in a SYCL kernel	4	1204	3 August 2021
Wrong data output SYCL development	2	438	17 February 2023
Passing multiple accessors to the kernel SYCL development	5	1380	18 May 2021

Not every item of nd_range<2> is executed with handler.parallel_for(range, kernel_class)

Related topics