DPCT1049:55: The workgroup size passed to the SYCL kernel may
exceed the limit. To get the device limit, query
info::device::max_work_group_size. Adjust the workgroup size if
needed.
So, how can I ask for the maximum blocks and threads? Thank you so much!
The converter tool has made this use a 3D range which doesn’t seem the best solution. We would suggest that you use something like this assuming your range is indeed 1D:
Hey Rod, sorry I was trying what you said and other alternatives.
The problem is that I’m getting different results. I don’t know why dpct migrated the kernel code with 3D, but in my kernel for example I’m accessing like this:
int tid = item_ct1.get_local_id(2) +
item_ct1.get_group(2) * item_ct1.get_local_range().get(2);
Converting from 3D to 1D (sycl::nd_item<1> item_ct1), Should I modify that accessing? One thing that I realized is that my new code (using 1D) runs 50% faster, so maybe there is a problem in the blocks and threads? Thank you
Hi, it might be best to post or link to the original CUDA kernel code, and the code you have made in SYCL otherwise we would be guessing a bit. Thanks.
I’m not launching exactly like that becuase in CUDA they are using a function variable, so I had to modify that part, but for this problem is the same because I have to use only the swSolveShortGpu kernel.
So, my problem is that I want to use the best “group size (threads)” for the current device, but I don’t know what is the correct way to achieve that.
We can’t see anything obviously wrong with your code. In terms of choosing the best work group size, there’s a blog post about this on our website. There is no definitive way to do it and some experimentation is likely to be needed.