SYCL problem with device pointer

Hi,

Recently i have a problem about SYCL when trying to copy a large size of data explicitly from host to device and storing this device pointer for later usage.

The workflow is as follows:

  1. class Data->host_A copied explicitly from host to device, retrieve device pointer and store it in Data->device_A

  2. class Simulation launch SYCL queue and in the kernel function Data->device_A will be used

I wonder how can I use SYCL classes achieving this functionality? class Buffer or class PointerMapper? or any other classes?

Thanks in advance.
Tao

Hi Tao,

This is an interesting scenario.

IMO, you could make use of a virtual pointer in this case:

  • Virtual Pointer - software-managed virtual pointer that facilitates a raw pointer interface for SYCL buffers. The class provides a get_buffer method can be used to retrieve the SYCL buffer associated with it, and a get_offset to also retrieve the offset from the base address of the given buffer.

To get some idea of the usage of the vptr class, have a look at the following content in computecpp-sdk:

This should address your issue if my understanding of it is correct.

-Georgi

Hi, Georgi,

Thank you for your reply.

I checked the virtual pointer example, it seems like a solution.

In all examples, they allocates memory on device and initialize some values. Then the device data are copied back and read from host using host pointer.

But what if copying a vector from host to device, then doing some calculations on device, copying them back to host. How can I use virtual pointer in this case?

Best regards,

Tao

Hi,

I seem to have misunderstood you then. In this case, you can simply use sycl::buffer.
SYCL buffers have scoped lifetime. Where and when you will create them depends on where you need to access the memory.

Therefore, your system can make use of a standard buffer by enclosing it in the Simulation class, and as such it will be available throughout the lifetime of the Simulation object.

template <typename T>
struct Data {
  T* host_data;    // data from the host
  size_t size;     // size of the array
};

template <typename T>
class Simulation {
 public:
  Simulation() : m_queue{sycl::default_selector} { }

  void load(const Data<T>& data) {

    m_data = std::make_shared<...>(...);   // from data

    m_device_buf = std::make_shared<...>(...); // from m_data

    // load device data
    m_queue.submit([&](sycl::handler& cgh) {

      auto dev_data_ptr = m_dev_data_buf->template get_access<...>(cgh);

      << launch kernel: copy host data to sycl buffer >>

    });

  }

  void run() {

    // compute device data
    m_queue.submit([&](sycl::handler& cgh) {
    
      auto dev_data_ptr = m_dev_data_buf->template get_access<...>(cgh);

      << launch kernel: compute simulation >>
    
    });

  }

  void do_something() {

    << prepare - do something not necessarily device related >>

  }
 
 private:
  std::shared_ptr<sycl::buffer<T>> m_dev_data_buf;
  std::shared_ptr<Data<T>> m_data;

  sycl::queue m_queue;
};

You will have the memory stored inside the buffer on load() and that buffer will live until the instantiated Simulation object is destroyed / implicit call of the destructor. SYCL follows C++ RAII for object lifetime.

And in your main function or wherever you are instantiating the Simulation:

std::vector<float> arr(128);
Data data{arr.data(), arr.size()};

Simulation<float> sim{};
sim.load(data);     // launch kernel to allocate the data on the device
sim.do_something(); // preparations .. analyses - anything
sim.run();          // launch kernel using the already allocated device memory

This is just a design approach suggestion, the above code snippets are not complete or tested but should give an idea on what to try. Unlike CUDA or OpenCL, SYCL uses accessors as a way to access the data on the device which is stored in buffers. On exit of the scope of queue.submit, the computed data is copied onto the buffer object from the accessor and as long as your buffer object is still alive you can re-use it or “use it later”.

-Georgi

Hi,

In your snippets, the load(const Data& data) function of Simulation class,

void load(const Data& data) {

… …

  << launch kernel: copy host data to sycl buffer >>

}

How to copy host data to device explicitly?

If I haven’t misunderstood it, sycl::buffer copies data implicitly during access of buffer, in Simulation->run() function?

Best regards,

Tao

Hi,

The reason that I proposed this question is that I want to make sure the “implicitly copy” occurs in which step.

Since in my design,

Data class holds a large number of double values. There is also a vector of Data.

So I intend to firstly copy them to device using sycl::buffer, then create a vector restoring these buffers.

After that, Simulation class launches kernels over and over again using different buffers to do multiple computing simulations

If during each simulation, the access of buffer implicitly leads to memcpy from host to device, that will consume a lot.

Best regards,

Tao

Hi,

This is why you need to keep your buffers alive.
I wrote a simple example here which showcases the importance of scope.

Here is what it does:

  • allocate host memory : an array filled with values
  • initialize an empty buffer that will store the data on device
    (this one is created in the most outer SYCL scope)
  • The initial launch - >copies data to device memory
    (using a temp buffer to copy our host array to device mem)
  • Your simulation loop -> modifies the data that’s already in device memory
    (using device memory : no host to device mem copies)
  • To prove that data is on device (assuming the buffer is still in the program/SYCL scope):
    uses the modified data that’s already in device memory to write to another array
    (using a temp buffer to copy another host array to device mem)
    Note: Just to note for your case you don’t need another_host_data. You can move up (in the more outer scope) your host_data and write the results

Once (explicitly) copied onto device the memory is there until the buffer exits scope, so you need to design your system with this in mind (read up on C++ RAII and SYCL Memory Model)

I will also provide you the sample source code:

namespace kernels {
class copy_dev_mem_kernel {};
class modify_dev_mem_kernel {};
class use_dev_mem_kernel {};
}  // namespace kernels

static constexpr size_t num_elements = 32;

int main() {
  // your host data
  std::vector<int> host_data(num_elements);
  std::fill(host_data.begin(), host_data.end(), 10);
  // another host data to test if the original data is copied to device mem
  std::vector<int> another_host_data(num_elements);

  // begin SYCL scope
  try {
    sycl::queue queue(sycl::default_selector{});

    sycl::buffer<int, 1> dev_data_buf((sycl::range<1>(num_elements)));

    // copies data to device memory
    {
      sycl::buffer<int, 1> data_buf(host_data.data(),
                                    sycl::range<1>(num_elements));
                                    
      // : command group scope
      queue.submit([&](sycl::handler& cgh) {
        auto data_acc = data_buf.get_access<sycl::access::mode::read>(cgh);
        auto dev_data_acc =
            dev_data_buf.get_access<sycl::access::mode::write>(cgh);
        auto index_space = sycl::nd_range<1>(sycl::range<1>(num_elements),
                                             sycl::range<1>(num_elements / 4));

        auto copy_dev_mem_kernel_func = [=](sycl::nd_item<1> item) {
          auto id = item.get_global_id(0);
          dev_data_acc[id] = data_acc[id];
        };

        cgh.parallel_for<kernels::copy_dev_mem_kernel>(
            index_space, copy_dev_mem_kernel_func);
      });
    }

    // modifies the data that's already in device memory
    {
      // : command group scope
      queue.submit([&](sycl::handler& cgh) {
        auto dev_data_acc =
            dev_data_buf.get_access<sycl::access::mode::read_write>(cgh);
        auto index_space = sycl::nd_range<1>(sycl::range<1>(num_elements),
                                             sycl::range<1>(num_elements / 4));

        auto modify_dev_mem_kernel_func = [=](sycl::nd_item<1> item) {
          auto id = item.get_global_id(0);
          dev_data_acc[id] = dev_data_acc[id] * 2;
        };

        cgh.parallel_for<kernels::modify_dev_mem_kernel>(
            index_space, modify_dev_mem_kernel_func);
      });
    }

    // uses the modified data that's already in device memory
    {
      sycl::buffer<int, 1> another_data_buf(another_host_data.data(),
                                            sycl::range<1>(num_elements));

      // : command group scope
      queue.submit([&](sycl::handler& cgh) {
        auto dev_data_acc =
            dev_data_buf.get_access<sycl::access::mode::read>(cgh);
        auto another_data_acc =
            another_data_buf.get_access<sycl::access::mode::write>(cgh);
        auto index_space = sycl::nd_range<1>(sycl::range<1>(num_elements),
                                             sycl::range<1>(num_elements / 4));

        auto use_dev_mem_kernel_func = [=](sycl::nd_item<1> item) {
          auto id = item.get_global_id(0);
          another_data_acc[id] = dev_data_acc[id];
        };

        cgh.parallel_for<kernels::use_dev_mem_kernel>(index_space,
                                                      use_dev_mem_kernel_func);
      });
    }
  } catch (const sycl::exception& e) {
    std::cerr << "Caught (synchronous) SYCL error:\n" << e.what() << std::endl;
  }
  // exit SYCL scope

  // print the values in another_host_data that were copied over from device mem
  auto print_data = [another_host_data]() -> void {
    std::cout << "Values: ";
    for (auto value : another_host_data) {
      std::cout << value << " ";
    }
    std::cout << std::endl;
  };
  print_data();

  return 0;
}

And output:
Values were initialized as 10 on host -> copied to device : still 10 -> modified on device : 20

Values: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 

This does not address your exact program (architecture design, etc.) but shows how to work with buffers in order to preserve device memory rather than doing host to device mem copies which as you have noted impact performance quite a lot considering larger memory transfers.

-Georgi

Hi,

My previous suggestion introduced something like a control buffer (device buffer).
SYCL has facilities to actually do explicit copy within the command group. The copy function you can find in the SYCL under the SYCL functions for explicit memory operations section.

However, if you are running your parallel_for right after you copied onto device, you wouldn’t gain much from using this. I would suggest to try both ways : control (device) buffer as in my example or using sycl::handler::copy.

Here is a copy example I took directly from the SYCL specification for your convenience when browsing this forum topic.

const size_t nElems = 10u;

// Create a vector and fill it with values 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
std::vector<int> v(nElems);
std::iota(std::begin(v), std::end(v), 0);

// Create a buffer with no associated user storage
buffer<int, 1> b{range<1>(nElems)};

// Create a queue
queue myQueue;

myQueue.submit([&](handler &cgh) {
  // Retrieve a ranged write accessor to a global buffer with access to the first half of the buffer
  accessor<int, 1, access::mode::write, access::target::global_buffer> acc(b, 
          range<1>(nElems / 2), id<1>(0));

  // Copy the first five elements of the vector into the buffer associated with the accessor
  cgh.copy(v.data(), acc);
});

Hope this helps! :slight_smile:

-Georgi