Question about max group size

mcostanzo · 20 January 2022 13:22

Hi everyone! I have a question about migrating this CUDA kernel:

kernel<<< blocks, threads >>>(...);

DPCT migrates this kernel as:

cgh.parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, blocks) *
sycl::range<3>(1, 1, threads),
sycl::range<3>(1, 1, threads)),
[=](sycl::nd_item<3> item_ct1) {
   kernel(...);
});

But DPCT alerts:

DPCT1049:55: The workgroup size passed to the SYCL kernel may
exceed the limit. To get the device limit, query
info::device::max_work_group_size. Adjust the workgroup size if
needed.

So, how can I ask for the maximum blocks and threads? Thank you so much!

rod · 20 January 2022 15:29

The converter tool has made this use a 3D range which doesn’t seem the best solution. We would suggest that you use something like this assuming your range is indeed 1D:

const sycl::nd_range<1>
range{sycl::range<1>(blocks*threads), sycl::range<1>(threads)};
cgh.parallel_for(range, [=]...)

Using 3D the blocking could be bad. With the 1D approach it should just work the blocking will be better.

You could also make sure that threads is <= than the max workgroup size, but it probably already is.
Does that solve your problem?

mcostanzo · 20 January 2022 16:42

Hey! Thank you so much.

And how can I get the maximum number of blocks and threads of the current device ?

rod · 20 January 2022 17:05

If you need to do that you can call

device.get_info<sycl::info::device::max_work_group_size>()

There’s some useful info on this sort of topic on this page too.

mcostanzo · 23 January 2022 14:58

Hey Rod, sorry I was trying what you said and other alternatives.

The problem is that I’m getting different results. I don’t know why dpct migrated the kernel code with 3D, but in my kernel for example I’m accessing like this:

  int tid = item_ct1.get_local_id(2) +
            item_ct1.get_group(2) * item_ct1.get_local_range().get(2);

Converting from 3D to 1D (sycl::nd_item<1> item_ct1), Should I modify that accessing? One thing that I realized is that my new code (using 1D) runs 50% faster, so maybe there is a problem in the blocks and threads? Thank you

rod · 24 January 2022 10:58

Hi, it might be best to post or link to the original CUDA kernel code, and the code you have made in SYCL otherwise we would be guessing a bit. Thanks.

mcostanzo · 24 January 2022 11:50

Hi, I’m trying to migrate this kernel.

I’m not launching exactly like that becuase in CUDA they are using a function variable, so I had to modify that part, but for this problem is the same because I have to use only the swSolveShortGpu kernel.

The oneapi code generated is the following:

cgh.parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, BLOCKS) *
sycl::range<3>(1, 1, THREADS),
sycl::range<3>(1, 1, THREADS)),
[=](sycl::nd_item<3> item_ct1) {
swSolveShortGpu(…);
});

So, my problem is that I want to use the best “group size (threads)” for the current device, but I don’t know what is the correct way to achieve that.

Thank you

rod · 25 January 2022 09:10

We can’t see anything obviously wrong with your code. In terms of choosing the best work group size, there’s a blog post about this on our website. There is no definitive way to do it and some experimentation is likely to be needed.

mcostanzo · 2 February 2022 12:51

Thank you so much Rod, I was able to solve it.

Have a great day.

Topic		Replies	Views
Work-group local accessors for hierarchical kernels SYCL development	15	642	28 August 2023
GPU optimisation of memory accesses	2	108	12 June 2024
Creating variable length array in a SYCL kernel	4	1227	3 August 2021
Matrix-multiply: Exception on size >= 8192 SYCL development	1	851	3 August 2020
Can't convert from CUDA 1-Channel texture to DPCT 4-Channel image_wrapper SYCL development	8	1439	6 October 2021

Question about max group size

Related topics