Question about max group size

Hi everyone! I have a question about migrating this CUDA kernel:

kernel<<< blocks, threads >>>(...);

DPCT migrates this kernel as:

cgh.parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, blocks) *
sycl::range<3>(1, 1, threads),
sycl::range<3>(1, 1, threads)),
[=](sycl::nd_item<3> item_ct1) {
   kernel(...);
});

But DPCT alerts:

DPCT1049:55: The workgroup size passed to the SYCL kernel may
exceed the limit. To get the device limit, query
info::device::max_work_group_size. Adjust the workgroup size if
needed.

So, how can I ask for the maximum blocks and threads? Thank you so much!

The converter tool has made this use a 3D range which doesn’t seem the best solution. We would suggest that you use something like this assuming your range is indeed 1D:

const sycl::nd_range<1>
range{sycl::range<1>(blocks*threads), sycl::range<1>(threads)};
cgh.parallel_for(range, [=]...)

Using 3D the blocking could be bad. With the 1D approach it should just work the blocking will be better.

You could also make sure that threads is <= than the max workgroup size, but it probably already is.
Does that solve your problem?

Hey! Thank you so much.

And how can I get the maximum number of blocks and threads of the current device ?

If you need to do that you can call

device.get_info<sycl::info::device::max_work_group_size>()

There’s some useful info on this sort of topic on this page too.

Hey Rod, sorry I was trying what you said and other alternatives.

The problem is that I’m getting different results. I don’t know why dpct migrated the kernel code with 3D, but in my kernel for example I’m accessing like this:

  int tid = item_ct1.get_local_id(2) +
            item_ct1.get_group(2) * item_ct1.get_local_range().get(2);

Converting from 3D to 1D (sycl::nd_item<1> item_ct1), Should I modify that accessing? One thing that I realized is that my new code (using 1D) runs 50% faster, so maybe there is a problem in the blocks and threads? Thank you

Hi, it might be best to post or link to the original CUDA kernel code, and the code you have made in SYCL otherwise we would be guessing a bit. Thanks.

Hi, I’m trying to migrate this kernel.

I’m not launching exactly like that becuase in CUDA they are using a function variable, so I had to modify that part, but for this problem is the same because I have to use only the swSolveShortGpu kernel.

The oneapi code generated is the following:

cgh.parallel_for(
sycl::nd_range<3>(sycl::range<3>(1, 1, BLOCKS) *
sycl::range<3>(1, 1, THREADS),
sycl::range<3>(1, 1, THREADS)),
[=](sycl::nd_item<3> item_ct1) {
swSolveShortGpu(…);
});

So, my problem is that I want to use the best “group size (threads)” for the current device, but I don’t know what is the correct way to achieve that.

Thank you

We can’t see anything obviously wrong with your code. In terms of choosing the best work group size, there’s a blog post about this on our website. There is no definitive way to do it and some experimentation is likely to be needed.

1 Like

Thank you so much Rod, I was able to solve it.

Have a great day.