Inside a nd_range kernel I have the following for loop which I would like to have it partitioned for threads. So for example :
Thread 0 : 0 -31
Thread 1: 32 - 63
…
When I change the k++ to k+=WG_SIZE it shows completely wrong results.
In the version with k++ it is correct for the initial kernel call but in the next call it misses some of the values to be updated. How could I fix this. ?
const int test_id = item.get_global_id(0);
const int tid = item.get_local_id(0); // threadIdx.x
const int gtid = item.get_group(0) * item.get_local_range().get(0) + tid;//blockIdx.x * blockDim.x + threadIdx.x
const int MAXWG = item.get_group_range(0); // gridDim.x
const int WG_SIZE = item.get_local_range(0); // blockDim.x
...
for (unsigned int k =tid; k < lim_end - lim_start; k+=WG_SIZE) {
// usm_data[k+lim_start] = 1;
atomic_ref<unsigned int, memory_order::relaxed, memory_scope::device, access::address_space::global_space> atomic_mask(usm_data[k+lim_start]);
atomic_mask.store(1);
}
}
We still don’t see key parts of your code, so it’s difficult to judge what might be wrong. For example, we don’t know what your lim_end and lim_start are. It would be helpful if you could post a full minimal reproducer showing your problem. Nevetherless, I’ll try to explain below what I’m guessing you’re looking for.
If you had an array of 1024 elements and wanted to fill it in parallel using 4 work groups with 256 work items each, you would do:
group 0 fills elements 0-255
group 1 fills elements 256-511
group 2 fills elements 512-767
group 3 fills elements 768-1023
and within each group of 256 items, each item would fill the element 256*groupId + itemId.
So that would be:
const int tid = item.get_local_id(0);
const int gid = item.get_group(0);
const int nt = item.get_local_range(0);
data[gid*nt + tid] = 1;
Now if you wanted to fill an array of 4096 elements with the same number of work groups and items, you would use each item 4 times. This can be done in two ways, either with:
group 0, item 0 filling elements 0,1,2,3
group 0, item 1 filling elements 4,5,6,7
group 1, item 0 filling elements 256, 257, 258, 259
group 1, item 1 filling elements 260, 261, 262, 263
and so on,
or it can be done with:
group 0, item 0 filling elements 0, 1024, 2048, 3072
group 0, item 1 filling elements 1, 1025, 2049, 3073
group 1, item 0 filling elements 256, 1280, 2304, 3328
group 1, item 1 filling elements 257, 1281, 2305, 3329
and so on
The memory access is better in the second case, because neighbouring threads write to neighbouring elements in the same clock cycle, i.e. when thread 0 writes to element 0, thread 1 writes to element 1 and thread 2 to element 2 all at the same time. Then in the next cycle they fill 1024, 1025, 1026. This way the GPU can make a batched memory operation over a continuous range.
This would be written in the code like this:
const int tid = item.get_local_id(0);
const int gid = item.get_group(0);
const int nt = item.get_local_range(0);
const int num_elements = 4096;
for (int i{gid*nt + tid}; i<num_elements; i+=nt) {
data[i] = 1;
}