Unintentional copying

Hello,

I’ve written some code that swaps two rows of 2d buffer shown below

static void swap_row(const size_t matrix_size, sycl::queue &q, sycl::buffer<Floating, 2> &data_buf,
                     sycl::buffer<size_t, 1> &swap, const size_t diagonal) {
    q.submit([&](sycl::handler &h) {
        auto acc_data = data_buf.template get_access<sycl::access::mode::read_write>(h);
        auto acc_swap = swap.template get_access<sycl::access::mode::read>(h, sycl::range<1>{1});
        h.parallel_for<class RowSwap<Floating, comps>>(sycl::range<1>{matrix_size + 1}, [=](sycl::id<1> id) {
            auto global_id = id[0];
            auto tmp = acc_data[diagonal][global_id];
            acc_data[diagonal][global_id] = acc_data[acc_swap[0]][global_id];
            acc_data[acc_swap[0]][global_id] = tmp;
        });
    });
}

This is called matrix_size -1 times, and diagonal starts at 0 then increments after each call. The index of the row to swap with is recalculated and put at swap[0].
I’ve got a couple questions about how diagonal and matrix_size get into the kernel.

  1. Are the variables diagonal and matrix_size copied into the kernel with each execution?
  2. If 1 is true, could that be causing lots of pauses in my program while it all the previous jobs finish so the runtime can copy those values to the device?
  3. Could it possibly be faster to store diagonal and matrix_size in buffers and then access them that way?

Thanks,
Finlay

Hello Finlay,

1: Yes, variables diagonal and matrix_size should act as if they were copied at each kernel call given that the lambda is capturing everything by copy ([=]). Even though they are marked as const, that doesn’t mean that they cannot be changed outside of your kernel so they cannot be cached on the device and have to be copied. If you were using a header-based-cpu-only implementation, maybe the compiler would be able to avoid doing the copy.

2: I don’t think that copying these variables is an issue in itself given the fact that kernels submissions are already pretty expensive (like on GPUs). Furthermore you have a read_write accessor which will serialise your kernel submissions. Finally acc_swap will systematically have to copy data to the device too.

3: Using buffers won’t help either, it will be counter productive as it will add more way code just to copy 16 bytes.

I would suggest two solutions:
1: You are already copying the whole matrix to the device with acc_data but use only one row of it. Maybe you could try to perform all your operations in a single launch ? You could do the permutations column-wise knowing the whole swap buffer. You’d end up with only a single kernel submission and a for loop over the row inside.
2: If you want to avoid copies to the device, you could look into Unified Shared Memory and specifically malloc_host, if supported. The memory allocated will reside on the host (your RAM) but will be accessible using Remote Memory Accesses from let’s say a GPU. Memory will end up being transferred, only when the device requests it, and won’t use device’s memory. But it’s not worth doing that for 16 bytes.

Hi Michel,
Thanks for the information. I decided to try loading diagonal and matrix_size into buffers and, as you suggested, it was slower than before.
As you talk about in point 2, I’m pretty sure most of my program time is spent doing kernel submissions so I will look in to minimizing the number of submissions.

Thank you for the suggestings.
Unfortunately I don’t know all the permuations up front. The whole program will loop along the diagonal (n : 0 to matrix_size-1) of the matrix, pick the greatest magnitude value below the diagonal n, swap the rows so that high magnitude is now in row n, then subtract some fraction of the new row n from the rest of the matrix. (Guassian Elimination)

Hello,
I’m not really sure what kind of hardware you’re targeting. If you’re doing Gaussian elimination if will be a little bit hard (not impossible) to reduce the number of kernel submissions. It also depends on whether you’re optimising for Latency or Throughput.

For Latency you could explore a tiling algorithm which will be able to more effectively divide the work across your whole device. But it’s going to be harder to implement.

For Throughput, I would move to a nd_range kernel. Each work-group “processes a matrix” and implements the whole Gaussian elimination algorithm and you use work-group wide barriers to keep the work-items working in sync. You could also use local memory to “broadcast” the pivot across the work-group. You could execute the whole algorithm in a single kernel submission.

The issue is that this technique won’t scale on a GPU. A matrix will be only processed by a single streaming multiprocessor. To keep them all busy, you could submit a lot a matrices, in different work-groups (this the throughput name). If you’re running on a CPU, you can run a single work-group across all your CPU threads which means that there’s no transistors left unused.

1 Like

Future versions of SYCL will hopefully be able to provide device-wide barriers which would at the end allow to run your computation in a single kernel submission, on a single matrix, across all the streaming multiprocessors of a GPU. But this requires some forward progress guarantees of the work-groups that we cannot provide right now.

Hi Michel,
I don’t think I’ll have time to implement it in this project, but your idea for throughput is really interesting and could apply to the task I’m doing quite well. I’m simulating circuits (in quite a basic way) and components often have tolerences, so there would likely need to be some kind of exploration of that space.
Thanks for your help,
Finlay