GPU optimisation of memory accesses

br-ko · 12 June 2024 15:07

Hi,
I am trying to follow the example from the optimisation guide (https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-1/sub-group.html)

  constexpr int N = 1024 * 1024;
  int *data = sycl::malloc_shared<int>(N, q);

  auto e = q.submit([&](auto &h) {
    h.parallel_for(sycl::nd_range(sycl::range{N / 16}, sycl::range{32}),
                   [=](sycl::nd_item<1> it) {
                     int i = it.get_global_linear_id();
                     auto sg = it.get_sub_group();
                     int sgSize = sg.get_local_range()[0];
                     i = (i / sgSize) * sgSize * 16 + (i % sgSize);
                     for (int j = 0; j < sgSize * 16; j += sgSize) {
                       data[i + j] = -1;
                     }
                   });
  });

I was trying to use similar fashion to improve the codes performance but it turns out to be the code is now 16 times slower. I noticed that in the example it does N/16 global work size which makes sense. When I tried to reduce my global work size by 16 it gave me this error message :

Caught a SYCL host exception:
Non-uniform work-groups are not supported by the target device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)
terminate called after throwing an instance of 'sycl::_V1::nd_range_error'
  what():  Non-uniform work-groups are not supported by the target device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)

When I checked the website of codeplay ( Troubleshooting - Guides - oneAPI for NVIDIA® GPUs - Products - Codeplay Developer) it said this message could be related with range(48,32) which makes sense that it is not divisible but in my case :

    const size_t local_size = 32;  // Number of work-items per work-group
    const size_t global_size = (((no_of_nodes + local_size - 1) / local_size) * local_size)/16;

does not look like a non divisible issue ?

duncan · 12 June 2024 16:10

Hi @br-ko,

the problem could be related to the precise size of no_of_nodes. You’ll need to make sure that after this compilation, the global size is rounded up to the nearest multiple of 32. If no_of_nodes is a power of two itself, your calculation works well, but outside of that restriction you can easily end up with an incompatible global and local range.

I hope this helps,
Duncan.

br-ko · 12 June 2024 21:04

Hi @duncan yes you were right, I needed to do

const size_t global_size = (((no_of_nodes + local_size - 1) / local_size) * local_size)/16;
const size_t global_size_final = (((global_size + local_size - 1) / local_size) * local_size);

To make sure that all of the values are divisible by local_size missed one of the iterations

Thank you again,

Topic		Replies	Views
Work-group local accessors for hierarchical kernels SYCL development	15	642	28 August 2023
Increasing Performance of a For Loop inside a NDRange	3	94	25 June 2024
Cuda_error_illegal_address oneAPI for NVIDIA GPUs	4	297	28 March 2024
Question about max group size SYCL development	8	1484	2 February 2022
Wrong data output SYCL development	2	441	17 February 2023

GPU optimisation of memory accesses

Related topics