Increasing Performance of a For Loop inside a NDRange

Hi,
I have the following kernel :

    nd_range<1> range(global_size, local_size);

    auto e = q.parallel_for<class TestKernel>(range, [=](nd_item<1> item) [[intel::kernel_args_restrict]] {
        int tid = item.get_global_linear_id();
        int lid = item.get_local_id(0);

        // Each work-group processes a segment of the nodes
        auto sg = item.get_sub_group();
        int sgSize = sg.get_local_range()[0];
        int i = (tid / sgSize) * sgSize*16 + (tid % sgSize);
 // Process nodes assigned to this work-item within the work-group
            for (int j = 0; j < sgSize*16; j += sgSize) {
                if (j < lim) { // making sure that j is smaller that gws
                    // Read from the pipe
                    unsigned int index = A[i + node];
                    // Process the current node in tiles
                    unsigned int start = lower_limit[index];
                    unsigned int end= upper_limit[index];

                    for (unsigned int k = start; k < end; k++) {
                        int id = B[k];
                        // there are more instructions here truncated to keep it short
                    }
                }

In short, we don’t know the inner for loop boundaries during the compile time. Is there an approach to make it perform better ? Although we serially access A and B respectively is it possible to do a speculative block retrieve from memory ? Is there a guidance for optimising the nested for loops inside the ndrange ? This version is actually way slower than the version I had without the sgSize which got me thinking that there must be something I am missing.

I am running it on NVIDIA A100 for the context

Thanks!

Hi @br-ko,

I’ve been looking at your code, and it’s been a little bit hard to intuit exactly what it’s trying to do. Part of the problem I suspect is that the compiler has very little information about what your code will exactly do at runtime. For example, looking at start and end, are there any guarantees about these values? Can they be the same on one thread, and 0-100 on another? Beyond that, pointer-chasing code like this is generally going to perform a bit less well than on, for example, a CPU, by nature of the hardware.

There are ways to do preloads in some fashion, you can also use some group functions to load a chunk from global memory to local, though in that case you would need to benchmark the code, as this optimisation generally benefits your code most when there is lots of data reuse between each individual thread.

I would also recommend running the Nvidia profiling tools over this code to see if it has any recommendations. It will be able to highlight what things it thinks are making it slow compared to the peak device performance. There is information here: Nsight Compute | NVIDIA Developer | NVIDIA Developer

I hope this helps,
Duncan.

Thank you, OK I think it is better to go step by step. Why this code performs more or less close to the performance of

nd_range<1> range(global_size, local_size);

    auto e = q.parallel_for<class TestKernel>(range, [=](nd_item<1> item) [[intel::kernel_args_restrict]] {
        int tid = item.get_global_linear_id();



 // Process nodes assigned to this work-item within the work-group

                if (tid < global_size) { // making sure that tid is smaller that gws
                    // Read from the pipe
                    unsigned int index = A[tid];
                    // Process the current node in tiles
                    unsigned int start = lower_limit[index];
                    unsigned int end= upper_limit[index];

                    for (unsigned int k = start; k < end; k++) {
                        int id = B[k];
                        // there are more instructions here truncated to keep it short
                    }
                

I am having hard times to understand all these work group,subgroup, local_id concepts since none of these makes any performance benefits

Secondly, Using the prefetch for B[k] didn’t give me any performance gains what am I doing wrong here ?

...
sycl::global_ptr<unsigned int> b_ptr(B + start);
b_ptr.prefetch(end-start); 
for (unsigned int k = 0; k < end-start; k++) {
  int id = b_ptr[k];
                        // there are more instructions here truncated to keep it short
                    }
                }

So, I saw that in example Prefetch it uses 2 pointers and seems like alternates between those but the issue is I am using tid which is global id, how could I employ this ?

Hi @br-ko,

I don’t think prefetch will do much here, because there are no instructions between the prefetch and the loop. I was instead suggesting something like the work group copy functions, if you know that different threads will access the same data: SYCL™ 2020 Specification (revision 8)

As I say, however, this style of code is just going to perform poorly: you’re not doing the sort of work that GPUs tend to be good at. Before the code can even start to do things with B, it has to perform three loads, and two of those are reliant on the result of the first. I would really recommend using the profiling tools to see if there are any specific recommendations they have, though fundamentally I would also recommend that you attempt to change the code such that it loads memory as coherently and consistently as possible.

1 Like