When does memory transfer happen in SYCL

atharva362 · 6 January 2022 11:18

Hello All

A basic question from my side. I was not able to get it clarified after reading from various online sources.

Suppose I have code like -

void first_differential_x(uchar* input, uchar* output, size_t M, size_t N, queue myQueue, size_t TILE_SIZE){

    buffer<uchar, 2> input_buffer{input, {M, N}};
    buffer<uchar, 2> output_buffer{output, {M, N}};

    auto local_range = range<2>(TILE_SIZE, TILE_SIZE);
    auto global_range = range<2>(M / TILE_SIZE + 1 , N / TILE_SIZE + 1) * local_range;

    auto launchParams = nd_range<2>(global_range, local_range);

    myQueue.submit([&input_buffer, &output_buffer, M, N, launchParams](handler& cgh){

        auto accessor_input = input_buffer.get_access<access::mode::read>(cgh);
        auto accessor_output = output_buffer.get_access<access::mode::write>(cgh);

        cgh.parallel_for<class dx>(launchParams, [accessor_input, accessor_output, M, N](nd_item<2> ndItem){
            auto x = ndItem.get_global_id(0);
            auto y = ndItem.get_global_id(1);

            accessor_output[x][y] = accessor_input[x][y] - accessor_input[x][y + 1];

        });
    });
}

When does the memory transfer happen, when I create the buffers or when .submit is called

TIA

rod · 7 January 2022 09:44

Hi,
Thanks for your question. The answer is that it’s not possible to know exactly when the memory transfer happens, it’s not invoked by any specific code you write. The command group handler deals with scheduling and the SYCL runtime decides when is the best time to do the memory transfer.
The main thing you can guarantee is that (a) it is there when a kernel starts execution and (b) it is there when you needed on the host.

atharva362 · 7 January 2022 12:03

Hello Rod

Thank you for your reply. My intention is to profile the run time (wall clock) of the kernel without taking into account the memory transfer from host to device and vice-versa. Any idea how I could do that?

rod · 7 January 2022 12:48

Hello,
Ok that makes sense.
There are a couple of options.
The easiest would be to use the processor profiling tools, they are likely to give you this information specifically.
If you want to do it manually however, you can do explicit copies.
I think this post will give you some hints, I’ll try to find an example for you.

rod · 7 January 2022 12:55

I think I found an example of how to do the copies before running the execution kernel.

github.com

zjin-lcf/HeCBench/blob/1cb6f2e5a2d142055fd6c51b15452d6e4d26c066/triad-sycl/triad2.cpp#L118

    
      
          
          
// start submitting blocks of data of size elemsInBlock
          // overlap the computation of one block with the data
          // download for the next block and the results upload for
          // the previous block
          int crtIdx = 0;
          size_t globalWorkSize = elemsInBlock;
          
          
int TH = Timer::Start();
          
          
p.submit([&] (handler &cgh) {
              auto d_memA0_acc = d_memA0.get_access<sycl_write>(cgh, range<1>(elemsInBlock));
              cgh.copy(h_mem, d_memA0_acc);
              });
          p.submit([&] (handler &cgh) {
              auto d_memB0_acc = d_memB0.get_access<sycl_write>(cgh, range<1>(elemsInBlock));
              cgh.copy(h_mem, d_memB0_acc);
              });
          p.submit([&] (handler &cgh) {
              auto A = d_memA0.get_access<sycl_read>(cgh, range<1>(elemsInBlock));
              auto B = d_memB0.get_access<sycl_read>(cgh, range<1>(elemsInBlock));

This one copies the data in two commands, then runs the execution command. I wouldn’t recommend this in general for development as the scheduler knows the best time to do the transfers, but if you are specifically trying to get this information then it would be the simplest way.

atharva362 · 8 January 2022 07:57

Thank you so much @rod

Topic		Replies	Views
Direct initialization of device memory for sycl::buffer oneAPI for NVIDIA GPUs	1	757	1 August 2023
SYCL Explicit Data Movement from Image to Host SYCL development	2	535	23 May 2023
Question about global memory in SYCL guide SYCL development	2	658	1 May 2019
Unintentional copying SYCL development	5	991	16 August 2021
Not every item of nd_range<2> is executed with handler.parallel_for(range, kernel_class) SYCL development	4	333	17 October 2023

When does memory transfer happen in SYCL

Related topics