SYCL problem with device pointer

Hi,

This is why you need to keep your buffers alive.
I wrote a simple example here which showcases the importance of scope.

Here is what it does:

  • allocate host memory : an array filled with values
  • initialize an empty buffer that will store the data on device
    (this one is created in the most outer SYCL scope)
  • The initial launch - >copies data to device memory
    (using a temp buffer to copy our host array to device mem)
  • Your simulation loop -> modifies the data that’s already in device memory
    (using device memory : no host to device mem copies)
  • To prove that data is on device (assuming the buffer is still in the program/SYCL scope):
    uses the modified data that’s already in device memory to write to another array
    (using a temp buffer to copy another host array to device mem)
    Note: Just to note for your case you don’t need another_host_data. You can move up (in the more outer scope) your host_data and write the results

Once (explicitly) copied onto device the memory is there until the buffer exits scope, so you need to design your system with this in mind (read up on C++ RAII and SYCL Memory Model)

I will also provide you the sample source code:

namespace kernels {
class copy_dev_mem_kernel {};
class modify_dev_mem_kernel {};
class use_dev_mem_kernel {};
}  // namespace kernels

static constexpr size_t num_elements = 32;

int main() {
  // your host data
  std::vector<int> host_data(num_elements);
  std::fill(host_data.begin(), host_data.end(), 10);
  // another host data to test if the original data is copied to device mem
  std::vector<int> another_host_data(num_elements);

  // begin SYCL scope
  try {
    sycl::queue queue(sycl::default_selector{});

    sycl::buffer<int, 1> dev_data_buf((sycl::range<1>(num_elements)));

    // copies data to device memory
    {
      sycl::buffer<int, 1> data_buf(host_data.data(),
                                    sycl::range<1>(num_elements));
                                    
      // : command group scope
      queue.submit([&](sycl::handler& cgh) {
        auto data_acc = data_buf.get_access<sycl::access::mode::read>(cgh);
        auto dev_data_acc =
            dev_data_buf.get_access<sycl::access::mode::write>(cgh);
        auto index_space = sycl::nd_range<1>(sycl::range<1>(num_elements),
                                             sycl::range<1>(num_elements / 4));

        auto copy_dev_mem_kernel_func = [=](sycl::nd_item<1> item) {
          auto id = item.get_global_id(0);
          dev_data_acc[id] = data_acc[id];
        };

        cgh.parallel_for<kernels::copy_dev_mem_kernel>(
            index_space, copy_dev_mem_kernel_func);
      });
    }

    // modifies the data that's already in device memory
    {
      // : command group scope
      queue.submit([&](sycl::handler& cgh) {
        auto dev_data_acc =
            dev_data_buf.get_access<sycl::access::mode::read_write>(cgh);
        auto index_space = sycl::nd_range<1>(sycl::range<1>(num_elements),
                                             sycl::range<1>(num_elements / 4));

        auto modify_dev_mem_kernel_func = [=](sycl::nd_item<1> item) {
          auto id = item.get_global_id(0);
          dev_data_acc[id] = dev_data_acc[id] * 2;
        };

        cgh.parallel_for<kernels::modify_dev_mem_kernel>(
            index_space, modify_dev_mem_kernel_func);
      });
    }

    // uses the modified data that's already in device memory
    {
      sycl::buffer<int, 1> another_data_buf(another_host_data.data(),
                                            sycl::range<1>(num_elements));

      // : command group scope
      queue.submit([&](sycl::handler& cgh) {
        auto dev_data_acc =
            dev_data_buf.get_access<sycl::access::mode::read>(cgh);
        auto another_data_acc =
            another_data_buf.get_access<sycl::access::mode::write>(cgh);
        auto index_space = sycl::nd_range<1>(sycl::range<1>(num_elements),
                                             sycl::range<1>(num_elements / 4));

        auto use_dev_mem_kernel_func = [=](sycl::nd_item<1> item) {
          auto id = item.get_global_id(0);
          another_data_acc[id] = dev_data_acc[id];
        };

        cgh.parallel_for<kernels::use_dev_mem_kernel>(index_space,
                                                      use_dev_mem_kernel_func);
      });
    }
  } catch (const sycl::exception& e) {
    std::cerr << "Caught (synchronous) SYCL error:\n" << e.what() << std::endl;
  }
  // exit SYCL scope

  // print the values in another_host_data that were copied over from device mem
  auto print_data = [another_host_data]() -> void {
    std::cout << "Values: ";
    for (auto value : another_host_data) {
      std::cout << value << " ";
    }
    std::cout << std::endl;
  };
  print_data();

  return 0;
}

And output:
Values were initialized as 10 on host -> copied to device : still 10 -> modified on device : 20

Values: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 

This does not address your exact program (architecture design, etc.) but shows how to work with buffers in order to preserve device memory rather than doing host to device mem copies which as you have noted impact performance quite a lot considering larger memory transfers.

-Georgi