For loop parallelise to each thread

Inside a nd_range kernel I have the following for loop which I would like to have it partitioned for threads. So for example :

Thread 0 : 0 -31
Thread 1: 32 - 63

When I change the k++ to k+=WG_SIZE it shows completely wrong results.
In the version with k++ it is correct for the initial kernel call but in the next call it misses some of the values to be updated. How could I fix this. ?

       const int test_id = item.get_global_id(0);
        const int tid     = item.get_local_id(0); // threadIdx.x
        const int gtid    = item.get_group(0) * item.get_local_range().get(0) + tid;//blockIdx.x * blockDim.x + threadIdx.x 
        const int MAXWG   = item.get_group_range(0); // gridDim.x
        const int WG_SIZE = item.get_local_range(0); // blockDim.x
...
                    for (unsigned int k =tid; k < lim_end - lim_start; k+=WG_SIZE) {
                            // usm_data[k+lim_start] = 1;
                             atomic_ref<unsigned int, memory_order::relaxed, memory_scope::device, access::address_space::global_space> atomic_mask(usm_data[k+lim_start]);
                              atomic_mask.store(1);
                        }
                    }

Can you post the launch config of the kernel please?

Hi @rod ,


    const size_t local_size = NUMBER_OF_WORKGROUPS;  // Number of work-items per work-group
    const size_t global_size = ((no_of_elems + local_size - 1) / local_size) * local_size;



    // Setup the range
    nd_range<1> range(global_size, local_size);
    
    auto e = q.parallel_for<class Test<krnl_id>>(range, [=](nd_item<1> item) [[intel::kernel_args_restrict]] {

We still don’t see key parts of your code, so it’s difficult to judge what might be wrong. For example, we don’t know what your lim_end and lim_start are. It would be helpful if you could post a full minimal reproducer showing your problem. Nevetherless, I’ll try to explain below what I’m guessing you’re looking for.

If you had an array of 1024 elements and wanted to fill it in parallel using 4 work groups with 256 work items each, you would do:

  • group 0 fills elements 0-255
  • group 1 fills elements 256-511
  • group 2 fills elements 512-767
  • group 3 fills elements 768-1023

and within each group of 256 items, each item would fill the element 256*groupId + itemId.

So that would be:

const int tid = item.get_local_id(0);
const int gid = item.get_group(0);
const int nt = item.get_local_range(0);

data[gid*nt + tid] = 1;

Now if you wanted to fill an array of 4096 elements with the same number of work groups and items, you would use each item 4 times. This can be done in two ways, either with:

  • group 0, item 0 filling elements 0,1,2,3
  • group 0, item 1 filling elements 4,5,6,7
  • group 1, item 0 filling elements 256, 257, 258, 259
  • group 1, item 1 filling elements 260, 261, 262, 263
  • and so on,

or it can be done with:

  • group 0, item 0 filling elements 0, 1024, 2048, 3072
  • group 0, item 1 filling elements 1, 1025, 2049, 3073
  • group 1, item 0 filling elements 256, 1280, 2304, 3328
  • group 1, item 1 filling elements 257, 1281, 2305, 3329
  • and so on

The memory access is better in the second case, because neighbouring threads write to neighbouring elements in the same clock cycle, i.e. when thread 0 writes to element 0, thread 1 writes to element 1 and thread 2 to element 2 all at the same time. Then in the next cycle they fill 1024, 1025, 1026. This way the GPU can make a batched memory operation over a continuous range.

This would be written in the code like this:

const int tid = item.get_local_id(0);
const int gid = item.get_group(0);
const int nt = item.get_local_range(0);
const int num_elements = 4096;

for (int i{gid*nt + tid}; i<num_elements; i+=nt) {
  data[i] = 1;
}

Please let us know if that helps.
Thanks,
Rafal

1 Like