Hi,
I am following this example Optimizing Memory Movement Between Host and Accelerator for optimizing memory access and copy between the host and the device. The link posted explains how to avoid copying the array allocated on the host to the device in case you will just read it. The problem that I have if I use the code locally is that even with const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
the memory address of the host array is not the same with accessor.
I am declaring it as follows:
const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
sycl::buffer<uchar, 1> image_in_buf(in.data, sycl::range<1>(in.rows * in.cols), props);
sycl::buffer<uchar, 1> image_out_buf(out.data, sycl::range<1>(frm.rows * frm.cols), props);
If I do explicit host memory allocation with int *host_array = sycl::malloc_host<int>(in.rows * in.cols, q);
then it is fine.
On my machine, I have both integrated and discrete GPU, but I use the integrated one and still have the issue.
Why is it that with buffers this is not working?
Regards,
Nedim
Hi Nedim,
SYCL specification of use_host_ptr
defines only that the host pointer will be reused and no new host allocation will happen. The use_host_ptr
property does not specify anything about the device pointer.
The zero-copy CPU to iGPU data transfer is a feature of OpenCL, and as such may be only expected when running SYCL code with the OpenCL backend. Indeed I confirmed your observations where malloc_host
address remains the same on the device with OpenCL. However, running with Level Zero I see different addresses.
I believe the zero-copy feature ensures only that physical memory is shared between the CPU and iGPU while the virtual addresses may still differ. Different virtual addresses could map to the same physical memory. I think this is down to the driver implementation. Were you able to verify that the unwanted copy actually happens and impacts your application’s performance?
I personally think the article you linked is a bit misleading in this one sentence about iGPU memory address. Perhaps it could use a bit more explanation, or maybe shouldn’t mention this feature in this context.
Thanks,
Rafal
Hi Rafal,
Thank you for the detailed answer! In that case, the article is misleading since there is no mention of OpenCL backend. I found it odd that even in this book (https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf) use_host_ptr
is described in Data access and storage section under SYCL programming interface, which gives then the “hope” that it should work in general.
I did a simple test, converting the color image to grayscale. Compared to OpenCV cvtColor
method and OpenCV performs better (10x faster). Even though this might be not the best example to do the comparison it just lets me believe that there exists copy between CPU and GPU.
I assume there is a copy in the end when we copy the result back to CPU. Is there more copying going on?
Hi @nedimhadzic,
as Rafal mentioned, the use_host_ptr
really only enforces that the SYCL runtime won’t make its own allocation internally to back the buffer on the host. For example, SYCL might make its own backing allocation to ensure that the size and alignment requirements for this sharing are met, regardless of what the user’s allocation looks like.
If you use device-only buffers, i.e. ones without a host pointer, just a range argument, what sorts of results do you get? You might also be able to use shared USM allocations to get the behaviour you want.
Hi @duncan
Got more understanding now, thanks!
The thing is that I get big difference between OpenCV convert to grayscale (which is done on CPU) and with SYCL. 100 microseconds vs 1300 microseconds. Right now I use buffers since it is more convinient and as I am taking the data from OpenCV structures. Here is the SYCL code below:
const sycl::property_list props = {sycl::property::buffer::use_host_ptr()};
sycl::buffer<uchar, 1> image_in_buf(frm.data, sycl::range<1>(frm.rows * frm.cols), props);
sycl::buffer<uchar, 1> image_out_buf(out.data, sycl::range<1>(frm.rows * frm.cols));
size_t frameStep = frm.step;
size_t channels = frm.channels();
size_t columns = frm.cols;
sycl::range<2> num_items{(uint32_t)frm.rows, (uint32_t)frm.cols};
q.submit([&](sycl::handler &h) {
sycl::accessor srcPtr(image_in_buf, h, sycl::read_only);
sycl::accessor dstPtr(image_out_buf, h, sycl::write_only, sycl::no_init);
h.parallel_for(num_items, [=](sycl::id<2> item)
{
int row = item[0];
int col = item[1];
int index = frameStep * row + col * channels;
unsigned char b = srcPtr[index] ;
unsigned char g = srcPtr[index + 1];
unsigned char r = srcPtr[index + 2];
dstPtr[columns * row + col] = (uchar)(0.299 * r + 0.587 * g + 0.144 * b);
}
);
});
I would not expect to be such a big difference. Am I missing something here?
So looking at the sample, the out buffer doesn’t have the property, so at bare minimum there will be one copy from there (internal allocation → user pointer). If you’re interested you can use something like the OpenCL tracing library or L0 tracing tool to see what calls SYCL is making to the underlying hardware, i.e. when it’s making device allocations, when it’s copying and so on. You could also check the size and alignment of the OpenCV allocations to see if they’re suitable for sharing with the device.
Added the property, but still does not help.
So I profiled it with L0 tracing tool you suggested. Here is the log:
>>>> [1532590306] zeCommandListAppendMemoryCopy: hCommandList = 0x5cc3ad0 dstptr = 0xffffc001ffcf0000 srcptr = 0x42f8e00 size = 921600 hSignalEvent = 0x51b9f70 numWaitEvents = 0 phWaitEvents = 0
<<<< [1532830665] zeCommandListAppendMemoryCopy [234405 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1532838873] zeCommandListClose: hCommandList = 0x5cc3ad0
<<<< [1532841665] zeCommandListClose [1136 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1532845229] zeCommandQueueExecuteCommandLists: hCommandQueue = 0x26644c0 numCommandLists = 1 phCommandLists = 0x7ffdb3ca4fb0 (hCommandLists = 0x5cc3ad0) hFence = 0x25c2aa0
<<<< [1532880073] zeCommandQueueExecuteCommandLists [32759 ns] hCommandLists = 0x5cc3ad0 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1532889078] zeEventHostSynchronize: hEvent = 0x51b9f70 timeout = 18446744073709551615
<<<< [1533059555] zeEventHostSynchronize [167515 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533064701] zeFenceQueryStatus: hFence = 0x25c2aa0
<<<< [1533067147] zeFenceQueryStatus [957 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533071340] zeFenceReset: hFence = 0x25c2aa0
<<<< [1533073088] zeFenceReset [175 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533076450] zeCommandListReset: hCommandList = 0x5cc3ad0
<<<< [1533086136] zeCommandListReset [8192 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533095137] zeKernelSetArgumentValue: hKernel = 0x21cd460 argIndex = 0 argSize = 8 pArgValue = 0x56d43a0
<<<< [1533097673] zeKernelSetArgumentValue [785 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533100666] zeKernelSetArgumentValue: hKernel = 0x21cd460 argIndex = 1 argSize = 8 pArgValue = 0x56d43a8
<<<< [1533102159] zeKernelSetArgumentValue [32 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533104866] zeKernelSetArgumentValue: hKernel = 0x21cd460 argIndex = 3 argSize = 8 pArgValue = 0
<<<< [1533107107] zeKernelSetArgumentValue [73 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533109204] zeKernelSetArgumentValue: hKernel = 0x21cd460 argIndex = 5 argSize = 8 pArgValue = 0
<<<< [1533110861] zeKernelSetArgumentValue [36 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533112520] zeKernelSetArgumentValue: hKernel = 0x21cd460 argIndex = 6 argSize = 8 pArgValue = 0x56d43f0
<<<< [1533113912] zeKernelSetArgumentValue [34 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533118305] zeKernelSetArgumentValue: hKernel = 0x21cd460 argIndex = 2 argSize = 8 pArgValue = 0x1f477d8
<<<< [1533120771] zeKernelSetArgumentValue [993 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533123919] zeKernelSetArgumentValue: hKernel = 0x21cd460 argIndex = 4 argSize = 8 pArgValue = 0x4ce0268
<<<< [1533125370] zeKernelSetArgumentValue [74 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533128039] zeKernelSuggestGroupSize: hKernel = 0x21cd460 globalSizeX = 1280 globalSizeY = 720 globalSizeZ = 1 groupSizeX = 0x7ffdb3ca4e20 groupSizeY = 0x7ffdb3ca4e24 groupSizeZ = 0x7ffdb3ca4e28
<<<< [1533131049] zeKernelSuggestGroupSize [851 ns] groupSizeX = 256 groupSizeY = 2 groupSizeZ = 1 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533134691] zeKernelSetGroupSize: hKernel = 0x21cd460 groupSizeX = 256 groupSizeY = 2 groupSizeZ = 1
<<<< [1533136863] zeKernelSetGroupSize [464 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533140891] zeCommandListAppendLaunchKernel: hCommandList = 0x4ca40b0 hKernel = 0x21cd460 (_ZTSZZL13convertToGrayRN4sycl3_V15queueERN2cv3MatES5_ENKUlRNS0_7handlerEE_clES7_EUlNS0_2idILi2EEEE_) pLaunchFuncArgs = 0x7ffdb3ca4da0 {5, 360, 1} hSignalEvent = 0x51bb350 numWaitEvents = 0 phWaitEvents = 0
<<<< [1533147380] zeCommandListAppendLaunchKernel [2591 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533150374] zeCommandListClose: hCommandList = 0x4ca40b0
<<<< [1533151932] zeCommandListClose [200 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533154306] zeCommandQueueExecuteCommandLists: hCommandQueue = 0x26644c0 numCommandLists = 1 phCommandLists = 0x7ffdb3ca4c20 (hCommandLists = 0x4ca40b0) hFence = 0x28e6580
<<<< [1533170965] zeCommandQueueExecuteCommandLists [14894 ns] hCommandLists = 0x4ca40b0 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533183751] zeEventQueryStatus: hEvent = 0x51bb350
<<<< [1533185845] zeEventQueryStatus [588 ns] -> ZE_RESULT_NOT_READY(0x1)
>>>> [1533189524] zeCommandListAppendWaitOnEvents: hCommandList = 0x5cc3ad0 numEvents = 1 phEvents = 0x4cf23a0 (hEvents = 0x51bb350)
<<<< [1533192215] zeCommandListAppendWaitOnEvents [719 ns] hEvents = 0x51bb350 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533195225] zeCommandListAppendMemoryCopy: hCommandList = 0x5cc3ad0 dstptr = 0x8b20c80 srcptr = 0xffffc001ffbf0000 size = 921600 hSignalEvent = 0x2262cb0 numWaitEvents = 0 phWaitEvents = 0
<<<< [1533281798] zeCommandListAppendMemoryCopy [84193 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533285623] zeCommandListClose: hCommandList = 0x5cc3ad0
<<<< [1533287111] zeCommandListClose [200 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533289324] zeCommandQueueExecuteCommandLists: hCommandQueue = 0x26644c0 numCommandLists = 1 phCommandLists = 0x7ffdb3ca6390 (hCommandLists = 0x5cc3ad0) hFence = 0x25c2aa0
<<<< [1533306237] zeCommandQueueExecuteCommandLists [14966 ns] hCommandLists = 0x5cc3ad0 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533308845] zeEventHostSynchronize: hEvent = 0x2262cb0 timeout = 18446744073709551615
<<<< [1533777299] zeEventHostSynchronize [466497 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533781185] zeFenceQueryStatus: hFence = 0x25c2aa0
<<<< [1533782808] zeFenceQueryStatus [227 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533785258] zeFenceReset: hFence = 0x25c2aa0
<<<< [1533786108] zeFenceReset [54 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533788729] zeCommandListReset: hCommandList = 0x5cc3ad0
<<<< [1533793401] zeCommandListReset [3394 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533796258] zeFenceQueryStatus: hFence = 0x28e6580
<<<< [1533797626] zeFenceQueryStatus [106 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533800462] zeFenceReset: hFence = 0x28e6580
<<<< [1533801980] zeFenceReset [17 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533804014] zeCommandListReset: hCommandList = 0x4ca40b0
<<<< [1533805862] zeCommandListReset [309 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533813741] zeMemGetAllocProperties: hContext = 0x16446e0 ptr = 0xffffc001ffbf0000 pMemAllocProperties = 0x7ffdb3ca6160 phDevice = 0x7ffdb3ca6190 (hDevice = 0x52f4960)
<<<< [1533816279] zeMemGetAllocProperties [663 ns] hDevice = 0x5cc000 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533824273] zeEventHostReset: hEvent = 0x2262cb0
<<<< [1533826656] zeEventHostReset [773 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533834133] zeMemGetAllocProperties: hContext = 0x16446e0 ptr = 0xffffc001ffcf0000 pMemAllocProperties = 0x7ffdb3ca6160 phDevice = 0x7ffdb3ca6190 (hDevice = 0x161a7b0)
<<<< [1533835874] zeMemGetAllocProperties [76 ns] hDevice = 0x5cc000 -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533840952] zeEventHostReset: hEvent = 0x51bb350
<<<< [1533842216] zeEventHostReset [37 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [1533846102] zeEventHostReset: hEvent = 0x51b9f70
<<<< [1533847604] zeEventHostReset [148 ns] -> ZE_RESULT_SUCCESS(0x0)
I see a memory copy at the beginning and at the end.
I tried then using the malloc_host
and malloc_shared
of course with the fact that I had to copy data values to the host-allocated pointer. Here is the trace:
>>>> [2378636928] zeMemAllocHost: hContext = 0x184bca0 host_desc = 0x7ffd2025cc50 {ZE_STRUCTURE_TYPE_HOST_MEM_ALLOC_DESC(0x16) 0 0} size = 1048576 alignment = 8 pptr = 0x7ffd2025ccc0 (ptr = 0)
<<<< [2378949605] zeMemAllocHost [300494 ns] ptr = 0x7f6f0fe0e000 -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379441600] zeMemAllocShared: hContext = 0x184bca0 device_desc = 0x7ffd2025cd10 {ZE_STRUCTURE_TYPE_DEVICE_MEM_ALLOC_DESC(0x15) 0 0 0} host_desc = 0x7ffd2025cd30 {ZE_STRUCTURE_TYPE_HOST_MEM_ALLOC_DESC(0x16) 0 0} size = 921600 alignment = 8 hDevice = 0x7d2000 pptr = 0x7ffd2025cda0 (ptr = 0)
<<<< [2379503250] zeMemAllocShared [50956 ns] ptr = 0x7f6f0fd1d000 -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379534457] zeKernelSetArgumentValue: hKernel = 0x5503500 argIndex = 0 argSize = 8 pArgValue = 0x5e8ddd0
<<<< [2379539463] zeKernelSetArgumentValue [1867 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379543820] zeKernelSetArgumentValue: hKernel = 0x5503500 argIndex = 1 argSize = 8 pArgValue = 0x5e8ddd8
<<<< [2379545452] zeKernelSetArgumentValue [47 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379548950] zeKernelSetArgumentValue: hKernel = 0x5503500 argIndex = 2 argSize = 8 pArgValue = 0x5e8dde0
<<<< [2379558611] zeKernelSetArgumentValue [8188 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379560824] zeKernelSetArgumentValue: hKernel = 0x5503500 argIndex = 3 argSize = 8 pArgValue = 0x5e8dde8
<<<< [2379562749] zeKernelSetArgumentValue [356 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379564631] zeKernelSetArgumentValue: hKernel = 0x5503500 argIndex = 4 argSize = 8 pArgValue = 0x5e8ddf0
<<<< [2379566271] zeKernelSetArgumentValue [147 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379573109] zeKernelSuggestGroupSize: hKernel = 0x5503500 globalSizeX = 1280 globalSizeY = 720 globalSizeZ = 1 groupSizeX = 0x7ffd2025be80 groupSizeY = 0x7ffd2025be84 groupSizeZ = 0x7ffd2025be88
<<<< [2379577564] zeKernelSuggestGroupSize [2213 ns] groupSizeX = 256 groupSizeY = 2 groupSizeZ = 1 -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379581479] zeKernelSetGroupSize: hKernel = 0x5503500 groupSizeX = 256 groupSizeY = 2 groupSizeZ = 1
<<<< [2379585361] zeKernelSetGroupSize [1582 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379591514] zeFenceQueryStatus: hFence = 0x5ebb8d0
<<<< [2379595209] zeFenceQueryStatus [1892 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379599193] zeFenceReset: hFence = 0x5ebb8d0
<<<< [2379601109] zeFenceReset [475 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379604864] zeCommandListReset: hCommandList = 0x5d1cb40
<<<< [2379612672] zeCommandListReset [6331 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379619511] zeEventHostReset: hEvent = 0x54d9160
<<<< [2379622381] zeEventHostReset [1457 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379628528] zeCommandListAppendLaunchKernel: hCommandList = 0x5d1cb40 hKernel = 0x5503500 (_ZTSZZL13convertToGrayRN4sycl3_V15queueERN2cv3MatES5_ENKUlRNS0_7handlerEE_clES7_EUlNS0_2idILi2EEEE_) pLaunchFuncArgs = 0x7ffd2025be00 {5, 360, 1} hSignalEvent = 0x54d9160 numWaitEvents = 0 phWaitEvents = 0
<<<< [2379643219] zeCommandListAppendLaunchKernel [10361 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379647501] zeCommandListClose: hCommandList = 0x5d1cb40
<<<< [2379649793] zeCommandListClose [653 ns] -> ZE_RESULT_SUCCESS(0x0)
>>>> [2379652963] zeCommandQueueExecuteCommandLists: hCommandQueue = 0x5523500 numCommandLists = 1 phCommandLists = 0x7ffd2025bc80 (hCommandLists = 0x5d1cb40) hFence = 0x5ebb8d0
<<<< [2380138248] zeCommandQueueExecuteCommandLists [483306 ns] hCommandLists = 0x5d1cb40 -> ZE_RESULT_SUCCESS(0x0)
No memory copy now. However, the overall time needed is still the same as with buffers since now allocation of the host and shared memory is taking time.
So, taking the assumption that I do not use any other library (like OpenCV) and read image data (in this case) directly to the allocated memory data with SYCL, it seems that there would not be performance advantages without OpenCL backend?
Hi,
Just to reply quickly now since my previous post was hidden for some reason (maybe because I posted the logs from the tracing tool). Until it appears hopefully, just to reply that missing property is not an issue. Even with that one, the execution time remains the same.
In short, the tracing tool shows that there are two memory copies, one at the beginning and one at the end (I guess when committing the result back to the host). I then tried with USM memory allocations, one host and one shared, and there is no memory copy, but allocation time is there so overall time execution, there are no benefits.
I can’t really do anything about the hidden comment, I’m afraid. I’d be really interested to see the trace! I might be able to ask someone to check what’s happening.
Here is the trace when using buffers: LINK
And here is the using the USM allocation LINK
I also tested using OpenCL platform and I can see an improvement, there is no copy then, but still OpenCV performs better. I thought that my image is small to see the difference, but then I just multiplied the image 10 times just to simulate big data. OpenCV is still better by 4 times.