For loop parallelise to each thread

br-ko · 1 July 2024 10:18

Inside a nd_range kernel I have the following for loop which I would like to have it partitioned for threads. So for example :

Thread 0 : 0 -31
Thread 1: 32 - 63
…

When I change the k++ to k+=WG_SIZE it shows completely wrong results.
In the version with k++ it is correct for the initial kernel call but in the next call it misses some of the values to be updated. How could I fix this. ?

       const int test_id = item.get_global_id(0);
        const int tid     = item.get_local_id(0); // threadIdx.x
        const int gtid    = item.get_group(0) * item.get_local_range().get(0) + tid;//blockIdx.x * blockDim.x + threadIdx.x 
        const int MAXWG   = item.get_group_range(0); // gridDim.x
        const int WG_SIZE = item.get_local_range(0); // blockDim.x
...
                    for (unsigned int k =tid; k < lim_end - lim_start; k+=WG_SIZE) {
                            // usm_data[k+lim_start] = 1;
                             atomic_ref<unsigned int, memory_order::relaxed, memory_scope::device, access::address_space::global_space> atomic_mask(usm_data[k+lim_start]);
                              atomic_mask.store(1);
                        }
                    }

rod · 2 July 2024 15:41

Can you post the launch config of the kernel please?

br-ko · 16 July 2024 09:47

Hi @rod ,


    const size_t local_size = NUMBER_OF_WORKGROUPS;  // Number of work-items per work-group
    const size_t global_size = ((no_of_elems + local_size - 1) / local_size) * local_size;



    // Setup the range
    nd_range<1> range(global_size, local_size);
    
    auto e = q.parallel_for<class Test<krnl_id>>(range, [=](nd_item<1> item) [[intel::kernel_args_restrict]] {

rbielski · 16 July 2024 10:56

We still don’t see key parts of your code, so it’s difficult to judge what might be wrong. For example, we don’t know what your lim_end and lim_start are. It would be helpful if you could post a full minimal reproducer showing your problem. Nevetherless, I’ll try to explain below what I’m guessing you’re looking for.

If you had an array of 1024 elements and wanted to fill it in parallel using 4 work groups with 256 work items each, you would do:

group 0 fills elements 0-255
group 1 fills elements 256-511
group 2 fills elements 512-767
group 3 fills elements 768-1023

and within each group of 256 items, each item would fill the element 256*groupId + itemId.

So that would be:

const int tid = item.get_local_id(0);
const int gid = item.get_group(0);
const int nt = item.get_local_range(0);

data[gid*nt + tid] = 1;

Now if you wanted to fill an array of 4096 elements with the same number of work groups and items, you would use each item 4 times. This can be done in two ways, either with:

group 0, item 0 filling elements 0,1,2,3
group 0, item 1 filling elements 4,5,6,7
group 1, item 0 filling elements 256, 257, 258, 259
group 1, item 1 filling elements 260, 261, 262, 263
and so on,

or it can be done with:

group 0, item 0 filling elements 0, 1024, 2048, 3072
group 0, item 1 filling elements 1, 1025, 2049, 3073
group 1, item 0 filling elements 256, 1280, 2304, 3328
group 1, item 1 filling elements 257, 1281, 2305, 3329
and so on

The memory access is better in the second case, because neighbouring threads write to neighbouring elements in the same clock cycle, i.e. when thread 0 writes to element 0, thread 1 writes to element 1 and thread 2 to element 2 all at the same time. Then in the next cycle they fill 1024, 1025, 1026. This way the GPU can make a batched memory operation over a continuous range.

This would be written in the code like this:

const int tid = item.get_local_id(0);
const int gid = item.get_group(0);
const int nt = item.get_local_range(0);
const int num_elements = 4096;

for (int i{gid*nt + tid}; i<num_elements; i+=nt) {
  data[i] = 1;
}

Please let us know if that helps.
Thanks,
Rafal

Topic		Replies	Views
Increasing Performance of a For Loop inside a NDRange	3	94	25 June 2024
Incorrect behaviour of parallel_for_work_group SYCL development	4	1852	9 July 2019
Work-group local accessors for hierarchical kernels SYCL development	15	642	28 August 2023
GPU optimisation of memory accesses	2	108	12 June 2024
Not every item of nd_range<2> is executed with handler.parallel_for(range, kernel_class) SYCL development	4	333	17 October 2023

For loop parallelise to each thread

Related topics