Memory access from the host

Hi,

I am following this example Optimizing Memory Movement Between Host and Accelerator for optimizing memory access and copy between the host and the device. The link posted explains how to avoid copying the array allocated on the host to the device in case you will just read it. The problem that I have if I use the code locally is that even with const sycl::property_list props = {sycl::property::buffer::use_host_ptr()}; the memory address of the host array is not the same with accessor.

I am declaring it as follows:

const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
sycl::buffer<uchar, 1> image_in_buf(in.data, sycl::range<1>(in.rows * in.cols), props);
sycl::buffer<uchar, 1> image_out_buf(out.data, sycl::range<1>(frm.rows * frm.cols), props);

If I do explicit host memory allocation with int *host_array = sycl::malloc_host<int>(in.rows * in.cols, q); then it is fine.

On my machine, I have both integrated and discrete GPU, but I use the integrated one and still have the issue.

Why is it that with buffers this is not working?

Regards,
Nedim

Hi Nedim,
SYCL specification of use_host_ptr defines only that the host pointer will be reused and no new host allocation will happen. The use_host_ptr property does not specify anything about the device pointer.

The zero-copy CPU to iGPU data transfer is a feature of OpenCL, and as such may be only expected when running SYCL code with the OpenCL backend. Indeed I confirmed your observations where malloc_host address remains the same on the device with OpenCL. However, running with Level Zero I see different addresses.

I believe the zero-copy feature ensures only that physical memory is shared between the CPU and iGPU while the virtual addresses may still differ. Different virtual addresses could map to the same physical memory. I think this is down to the driver implementation. Were you able to verify that the unwanted copy actually happens and impacts your application’s performance?

I personally think the article you linked is a bit misleading in this one sentence about iGPU memory address. Perhaps it could use a bit more explanation, or maybe shouldn’t mention this feature in this context.

Thanks,
Rafal

Hi Rafal,

Thank you for the detailed answer! In that case, the article is misleading since there is no mention of OpenCL backend. I found it odd that even in this book (https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf) use_host_ptr is described in Data access and storage section under SYCL programming interface, which gives then the “hope” that it should work in general.

I did a simple test, converting the color image to grayscale. Compared to OpenCV cvtColor method and OpenCV performs better (10x faster). Even though this might be not the best example to do the comparison it just lets me believe that there exists copy between CPU and GPU.

I assume there is a copy in the end when we copy the result back to CPU. Is there more copying going on?

Hi @nedimhadzic,

as Rafal mentioned, the use_host_ptr really only enforces that the SYCL runtime won’t make its own allocation internally to back the buffer on the host. For example, SYCL might make its own backing allocation to ensure that the size and alignment requirements for this sharing are met, regardless of what the user’s allocation looks like.

If you use device-only buffers, i.e. ones without a host pointer, just a range argument, what sorts of results do you get? You might also be able to use shared USM allocations to get the behaviour you want.

Hi @duncan

Got more understanding now, thanks!

The thing is that I get big difference between OpenCV convert to grayscale (which is done on CPU) and with SYCL. 100 microseconds vs 1300 microseconds. Right now I use buffers since it is more convinient and as I am taking the data from OpenCV structures. Here is the SYCL code below:

    const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
    sycl::buffer<uchar, 1> image_in_buf(frm.data, sycl::range<1>(frm.rows * frm.cols), props);
    sycl::buffer<uchar, 1> image_out_buf(out.data, sycl::range<1>(frm.rows * frm.cols));
  
    size_t frameStep = frm.step;
    size_t channels = frm.channels();
    size_t columns = frm.cols;
    sycl::range<2> num_items{(uint32_t)frm.rows, (uint32_t)frm.cols};

    q.submit([&](sycl::handler &h) {
      sycl::accessor srcPtr(image_in_buf, h, sycl::read_only);
      sycl::accessor dstPtr(image_out_buf, h, sycl::write_only, sycl::no_init);
      h.parallel_for(num_items, [=](sycl::id<2> item)
      {
        int row = item[0];
        int col = item[1];
        int index = frameStep * row + col * channels;
        unsigned char b = srcPtr[index] ;
        unsigned char g = srcPtr[index + 1];
        unsigned char r = srcPtr[index + 2];
        dstPtr[columns * row + col] = (uchar)(0.299 * r + 0.587 * g + 0.144 * b);

      }
    );
  });

I would not expect to be such a big difference. Am I missing something here?

So looking at the sample, the out buffer doesn’t have the property, so at bare minimum there will be one copy from there (internal allocation → user pointer). If you’re interested you can use something like the OpenCL tracing library or L0 tracing tool to see what calls SYCL is making to the underlying hardware, i.e. when it’s making device allocations, when it’s copying and so on. You could also check the size and alignment of the OpenCV allocations to see if they’re suitable for sharing with the device.

Hi,

Just to reply quickly now since my previous post was hidden for some reason (maybe because I posted the logs from the tracing tool). Until it appears hopefully, just to reply that missing property is not an issue. Even with that one, the execution time remains the same.

In short, the tracing tool shows that there are two memory copies, one at the beginning and one at the end (I guess when committing the result back to the host). I then tried with USM memory allocations, one host and one shared, and there is no memory copy, but allocation time is there so overall time execution, there are no benefits.

I can’t really do anything about the hidden comment, I’m afraid. I’d be really interested to see the trace! I might be able to ask someone to check what’s happening.

Here is the trace when using buffers: LINK

And here is the using the USM allocation LINK

I also tested using OpenCL platform and I can see an improvement, there is no copy then, but still OpenCV performs better. I thought that my image is small to see the difference, but then I just multiplied the image 10 times just to simulate big data. OpenCV is still better by 4 times.