Write access to a 2D buffer results in an an illegal memory access (CUDA_ERROR_ILLEGAL_ADDRESS) for specific buffer sizes n*n.
When running on one of my Nvidia GPUs, the following code works well for n = 1000, 1400, 1800, and 2000, but it crashes for n = 1200 and 1600. Running the same code on a CPU works well for all tested n.

	sycl::queue q;
	const int n = 1000;  // n = 1200;
	std::vector<int> c(n*n);
	sycl::range<2> r(n, n);
	sycl::buffer cBuf(, r);

	q.submit([&](auto &h) {
		sycl::accessor cAcc(cBuf, h, sycl::write_only);
		h.parallel_for(r, [=](auto index) {
			cAcc[index] = 0; // crash

Hi @christoph.stamm,

Thanks for your report, I’ve confirmed that this happens on the 2024 release of oneAPI. I will try to reproduce it on the master branch as well.

Meanwhile, you can use USM, which shouldn’t have this problem, or you can try using buffers without the (pointer, range) constructor, then initialise the buffers on the device. As far as I can tell the problem is somewhere in the 2D copy code, but I’ve not been able to find out much more than that.

You can also use try/catch blocks and the queue exception handler to get more information when things like this happen in future, though CUDA has a bit of a habit of returning errors from previous launches through the next API call you make, which can confuse the issue a little.

Thank you for the report,

This isn’t a buffer specific issue, it is a problem with parallel_for when using a range. The problem has been fixed and parallel_for range works with any size in the open source intel/llvm. This fix will make its way into a release soon. Probably in 2024.2.
If you switch to using an nd_range kernel then there is no issue in the current release, so that might be a solution for now, unless you want to use the open source repo that already has the fix.

Thanks a lot for your reply! nd_range works well and it gives a much better performance by specifying a useful local range.

Yes nd_range should pretty much always be used for gpu programming. I suspect that also for a program where you want it to be portable across gpu/cpus, it will make sense to use nd_range because

  • allows specification of ideal gpu work-group size (block size in cuda parlance) which is absolutely key for gpu performance.
  • Shouldn’t affect cpu implementation negatively irrespective of workgroup size.

For the second point I can’t stake my life on this because I’ve never tested this supposition myself, and it is possible that I am wrong, but I don’t think the concept of workgroup size really makes much sense for cpus generally and I guess that the implementation effectively ignores it. I think the subgroup maps more naturally onto the number of cpu threads allocated as the “device”, and this is set without user specification normally.