Hi,
I am trying to follow the example from the optimisation guide (https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-1/sub-group.html)
constexpr int N = 1024 * 1024;
int *data = sycl::malloc_shared<int>(N, q);
auto e = q.submit([&](auto &h) {
h.parallel_for(sycl::nd_range(sycl::range{N / 16}, sycl::range{32}),
[=](sycl::nd_item<1> it) {
int i = it.get_global_linear_id();
auto sg = it.get_sub_group();
int sgSize = sg.get_local_range()[0];
i = (i / sgSize) * sgSize * 16 + (i % sgSize);
for (int j = 0; j < sgSize * 16; j += sgSize) {
data[i + j] = -1;
}
});
});
I was trying to use similar fashion to improve the codes performance but it turns out to be the code is now 16 times slower. I noticed that in the example it does N/16
global work size which makes sense. When I tried to reduce my global work size by 16 it gave me this error message :
Caught a SYCL host exception:
Non-uniform work-groups are not supported by the target device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)
terminate called after throwing an instance of 'sycl::_V1::nd_range_error'
what(): Non-uniform work-groups are not supported by the target device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)
When I checked the website of codeplay ( Troubleshooting - Guides - oneAPI for NVIDIA® GPUs - Products - Codeplay Developer) it said this message could be related with range(48,32) which makes sense that it is not divisible but in my case :
const size_t local_size = 32; // Number of work-items per work-group
const size_t global_size = (((no_of_nodes + local_size - 1) / local_size) * local_size)/16;
does not look like a non divisible issue ?