Explicit copy from host buffer to device buffer uses uninitialized data

I realize that the spec has only been clarified recently regarding all the possible combinations of accessors/pointers when it comes to explicit memory operations - so maybe this has simply not been implemented yet. Anyhow, I believe according to SYCL 1.2.1 Rev 6 something like this should be possible:

    #include <cassert>
    #include <cstdio>
    #include <vector>

    #include <CL/sycl.hpp>

    int main(int argc, char* argv[]) {
    	constexpr size_t range = 16;

    	cl::sycl::queue queue{cl::sycl::cpu_selector{}};

    	std::vector<size_t> host_data(range);
    	for(size_t i = 0; i < host_data.size(); ++i) {
    		host_data[i] = i;
    	}

    	cl::sycl::buffer<size_t, 1> host_buf(host_data.data(), range);
    	cl::sycl::buffer<size_t, 1> device_buf(range);

    	auto r_host = host_buf.get_access<cl::sycl::access::mode::read>();
    	queue.submit([&](cl::sycl::handler& cgh) {
    		auto dw_a = device_buf.get_access<cl::sycl::access::mode::discard_write>(cgh);
    		cgh.copy(r_host, dw_a);
    	});

    	auto r_a = device_buf.get_access<cl::sycl::access::mode::read>();

    	for(size_t i = 0; i < range; ++i) {
    		printf("%llu\n", r_a[i]);
    		assert(r_a[i] == i);
    	}

    	return 0;
    }

In short, I’m trying to copy data from a host-initialized buffer into a device buffer using cl::sycl::handler::copy. This means I’m passing a host-accessor for the source, and a device-accessor for the target. I would then expect the same data that host_buf was initialized with would also be in device_buf. What actually happens as of ComputeCpp 1.1.6 however is that apparently uninitialized memory is being copied to the device_buf.

Hi psalz,

Yes, that’s right, so in the recent SYCL 1.2.1 revision, a clarification has been made which explicitly permits the explicit asynchronous copy operations now permits accessors of different element types, dimensionalities, access modes and access targets.

However, the restriction that host accessors must be used outside of a command group and device accessors must be used inside of a command group remains, as host accessors provide immediate blocking access for the host application on construction and releases it on destruction. Perhaps we should make the specification clearer on this.

So the issue in the code example above is that the source accessor in the copy operation is a host accessor which is not permitted so the resulting behaviour is undefined.

It’s also best not to think about buffers in terms of host buffers and device buffers. Buffers in SYCL represent data that can exist on the host and/or on any number of devices, and ther requirements for the data to be available for a given access mode in any given place is specified using accessors. This abstraction is designed to allow implementations the freedom to move data around (or not) to most efficiently meet the requirements of each command group.

For your use case, if you are simply looking to take some data initialized on the host and copy this to a buffer being accessed on the device then a way to achieve this is to use the copy operation which takes a host pointer as the source parameter and the device accessor as the destination parameter.

cl::sycl::buffer<size_t, 1> device_buf(range);

queue.submit([&](cl::sycl::handler& cgh) {
  auto dw_a = device_buf.get_access<cl::sycl::access::mode::discard_write>(cgh);
  cgh.copy(host_data.data(), dw_a);
});

However, if you need already have two buffers and you need to copy data from one to the other you should use the copy operation you have there but with two device accessors, and the SYCL runtime will copy from one to the other regardless of where the latest copy of the data resides.

cl::sycl::buffer<size_t, 1> src_buf(host_data.data(), range);
cl::sycl::buffer<size_t, 1> dest_buf(range);

queue.submit([&](cl::sycl::handler& cgh) {
  auto src_a = src_buf.get_access<cl::sycl::access::mode::read>(cgh);
  auto dest_a = dest_buf.get_access<cl::sycl::access::mode::discard_write>(cgh);
  cgh.copy(src_a , dest_a );
});

I hope this helps,

Gordon

Hello Gordon,

thanks for your response. I see, I did not know that this was in fact undefined behavior. I checked the spec and it seems you’re right, although the wording could certainly be more clear. In section 6.7.6.3 it says that

If an accessor has the access target access::target::global_buffer […] then it is considered a device accessor, and therefore can only be used within a SYCL kernel function and must be associated with a command group.

and

If an accessor has the access target access::target::host_buffer […] then it is considered a host accessor and can only be used on the host.

Now in my mind the cl::sycl::handler::copy function is not a kernel and not really running on either the host or device (but rather both), but since it is described in section 4.8. “Expressing parallelism through kernels” I guess it is consistent within the spec.


Now back to what I’m actually trying to do here. Other APIs such as OpenCL and CUDA provide me with something like a cudaMemcpy2D, which I can use to do strided copies of host memory to device memory. If the above is not allowed, then as far as I can see SYCL does not provide any such facility, and instead always requires host memory to be contiguous.

The reason why I’m referring to “host buffer” and “device buffer” is because in my envisioned use case I would create a temporary SYCL buffer wrapping an existing block of const host data. I could then use an accessor with a smaller range and/or offset to do a strided copy of said host memory into the actual device buffer where I need it. After this, the SYCL buffer on the host could be free’d again right away. If I were using two device accessors however, I would incur the additional cost of allocating the memory on the device for both the source (which on top of that, to my knowledge in any existing SYCL implementation would be the full buffer, not just my accessed subrange) and the target , which is not always possible when working near the limit of available device memory.

So really the only option I have is to first copy my strided data into a contiguous host side staging buffer, and then do a copy using the raw host pointer. Granted, this might actually be faster in many cases, but I can’t benchmark it either…