Utilizing Preallocated Device Memory

Hello Everyone,

I am attempting to write custom PyTorch kernels utilizing SYCL and am noticing some slowdowns compared to the PyTorch CUDA backend. This is likely due to my design choices in SYCL (i.e. buffers instead of USM), so please let me know if you see any choices that may be hurting performance.

Specs for testing:

  • OS: Red Hat Enterprise Linux 8.10 (Ootpa)
  • CPU: AMD EPYC 7742
  • GPU: NVIDIA A100-SXM4-80GB
  • GPU Driver: 555.42.02
  • OneAPI Version: 2024.0
  • OneMKL Version: 0.4
    • Built for CUDA
  • Python Version: 3.9.6

This is a very simple example that calls the oneAPI oneMKL GEMM function with the cuBLAS backend. I chose this example because I believe that PyTorch utilizes cuBLAS as well for simple matrix multiplications.

The SYCL code is as follows:

#include <sycl/sycl.hpp>
#include "oneapi/mkl/blas.hpp"
#include <vector>
#include <cmath>
#include <iostream>

using namespace sycl;
namespace mkl = oneapi::mkl;

extern "C" void gemm_test(float* qq, int Q, float* kk, int K, int Embedded_D, float* out) {
        // Create the queue.
        queue q(gpu_selector_v);

        // Create the buffers.
        auto qQ = buffer(qq, range<1> { static_cast<size_t>(Q * Embedded_D) });
        auto kK = buffer(kk, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oO = buffer(out, range<1> { static_cast<size_t>(Q * K) });

        // Variables for the GEMM operation.
        int64_t m = Q;
        int64_t n = K;
        int64_t k = Embedded_D;

        int64_t ldq = k;
        int64_t ldk = k;
        int64_t ldt = n;

        float alpha = 1.0;
        float beta = 0.0;

        // The actual GEMM call.
        mkl::blas::column_major::gemm(
                oneapi::mkl::backend_selector<oneapi::mkl::backend::cublas>{ q },
                mkl::transpose::trans, 
                mkl::transpose::nontrans,
                n, 
                m, 
                k,  
                alpha, 
                kK,
                ldk,
                qQ,
                ldq,
                beta,
                oO, 
                ldt 
        );

        // Waiting for the operation to complete.
        q.wait();
}

This is compiled to a shared object that I then call from Python using the ctypes FFI. The compilation process was:

  • icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 -fPIC -I/home/nkt8/oneMKL_CUDA_backend/include gt.cpp -c
  • icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 -fPIC -shared -o mylib.so gt.o -L/home/nkt8/oneMKL_CUDA_backend/lib -lonemkl -lonemkl_blas_cublas -lm

The code for benchmarking is as follows:

import time
import torch
from ctypes import CDLL, POINTER, c_float, c_int, cast, c_void_p

# Store the specified dimensions.
Q = 2048
K = 4096
D = 1024

# Set a default data type.
torch.set_default_dtype(torch.float32)

# Create random Q and K matrices on the GPU.
q = torch.rand([Q, D], device="cuda")
k = torch.rand([K, D], device="cuda")

# Time 250 iterations of the native PyTorch matrix multiplication.
start = time.monotonic()

for _ in range(250):
    out = q @ k.T
    torch.cuda.synchronize()

stop = time.monotonic()

print(f"Torch time: {stop - start} s")

# Create matrices of the same size on the CPU.
q_c = torch.rand([Q, D], device="cpu")
k_c = torch.rand([K, D], device="cpu")

# Time PyTorch with forced copies.
start = time.monotonic()

for _ in range(250):
    # Force copies to the GPU.
    q_c = q_c.cuda()
    k_c = k_c.cuda()

    out = q @ k.T
    torch.cuda.synchronize()

    # Force copies to the CPU.
    out = out.cpu()
    q_c = q_c.cpu()
    k_c = k_c.cpu()

stop = time.monotonic()

print(f"Torch time with copy: {stop - start} s")


# Shorthand float*
cust = POINTER(c_float)

# SYCL function testing.
lib = CDLL("/home/nkt8/SC_24/sycl/dev_test/mylib.so")["gemm_test"]
lib.restype = None
lib.argtypes = [cust, c_int, cust, c_int, c_int, cust]

# Preallocate the output on the GPU.
outer = torch.zeros([Q, K], device="cuda")

# Cast the raw GPU pointers to float*
q_in = cast(q.data_ptr(), cust)
k_in = cast(k.data_ptr(), cust)
o_in = cast(outer.data_ptr(), cust)

# This is just to verify that the pointer addresses are not identical.
print(hex(q.data_ptr())) # Should be on the GPU.
print(hex(q.cpu().numpy().ctypes.data)) # Should be on the CPU.

start = time.monotonic()

for _ in range(250):
    lib(q_in, Q, k_in, K, D, o_in)

stop = time.monotonic()

print(f"SYCL time (GPU pointers): {stop - start} s")

# A second example using memory allocated on the CPU instead of GPU

# Create the output buffer on the CPU.
outer = torch.zeros([Q, K], device="cpu")

# Cast the data pointers to float*
q_in = cast(q_c.data_ptr(), cust)
k_in = cast(k_c.data_ptr(), cust)
o_in = cast(outer.data_ptr(), cust)

# Verify that the data pointer matches that of the data "moved" (should not happen) to the CPU
print(hex(q_c.data_ptr()))
print(hex(q_c.cpu().numpy().ctypes.data))

start = time.monotonic()

for _ in range(250):
    lib(q_in, Q, k_in, K, D, o_in)

stop = time.monotonic()

print(f"SYCL time (CPU pointers): {stop - start} s")

print("The FFI finished.")

I tried to create suitably large tensors and ran each operation 250 times so we can get an idea of how the function behaves over many calls. I also made sure to synchronize the PyTorch operations to match queue.wait() on the SYCL side.

There are two variants of the PyTorch:

  1. Tensors created in the GPU and operated on in the GPU (they never leave)
  2. Tensors created on the CPU, moved to the GPU, operated on in the GPU, and then moved back to the CPU.

These two examples were created to try and match the two variants of pointers passed to the SYCL code:

  1. Tensors created on the GPU by PyTorch and a pointer to their memory address in device memory.
  2. Tensors created on the CPU by PyTorch and a pointer to their memory address in host memory.

The timing results are relatively consistent for PyTorch, but are more varied for SYCL. Here are the results for three different runs of the bench script (in seconds):

  1. PyTorch GPU only:
    a. 0.382 s
    b. 0.354 s
    c. 0.341 s
  2. PyTorch with copies:
    a. 7.491 s
    b. 7.134 s
    c. 7.040 s
  3. SYCL with GPU pointers:
    a. 3.058 s
    b. 7.802 s
    c. 8.006 s
  4. SYCL with CPU pointers:
    a. 5.275 s
    b. 14.630 s
    c. 5.851 s

The output results from SYCL match those of PyTorch on both the GPU and CPU pointer examples (just in column-major order when returned, but that’s not an issue). Even the fastest benches of the SYCL code are on the order of ~10-20x slower than PyTorch without memory transfers. I thought this may have been due to implicit memory transfers; this is why I added the PyTorch bench with forced copies to get an idea of the expected slowdown (even though we miss one transfer on the preallocated output buffer).

Looking at the SYCL reference for buffers (https://github.com/KhronosGroup/SYCL_Reference/blob/main/source/iface/buffer.rst) I notice that all buffer constructors deal with host memory (or ranges and iterators). By using buffers, is SYCL implicitly copying GPU memory to the host before moving it back to the device for GEMM?

I initially avoided USM because the reference (https://github.com/KhronosGroup/SYCL_Reference/blob/main/source/iface/usm_allocations.rst) appears to show that it actually performs an allocation on the specified host/device.

So my main questions can be boiled down to:

  1. Is there a way to avoid my hypothesized memory transfers?
  2. Could we convert a pointer to USM device memory without having to allocate and copy?

Also, if anyone has any ideas as to the inconsistencies in runtimes between the SYCL benchmarks, I’d love to hear them. (I’m wondering if this could be due to my choice to use dynamic linked libraries on the Python side, but am not positive)

Thank you for reading my post! Please let me know if anything is unclear and I will elaborate.

Hi @ntomczak,

thank you for the detailed question!

I think the first thing I noticed scanning your code is that you create a new queue each time the function is called. Queue creation is a slow operation (and is expected to be). This is likely going to account for some of the difference between the SYCL and CUDA versions. As much as is possible, I would recommend caching SYCL objects (they’re designed to be safe and cheap to copy around as much as needed). The same therefore goes for buffers - creating and destroying them all the time isn’t free.

To your questions, I’ll try to answer with the approximate workings of each type of memory management. Buffers are lazy and, in general, will only allocate on-device when actually required to (generally the first time they’re used in a kernel submission). Because buffers are a high-level construct, they can also have host-side backing, to allow for complex data patterns (e.g. moving data from one device to another across backends). That being said, the buffer constructors that take host pointers, iterators and so on will allocate on the host, then when used will have a corresponding device allocation. The constructors which only take a range are (generally) device-only, will allocate on device when required, and will not normally have an associated host allocation (unless you do something that needs one, like the data movements mentioned).

USM can be thought of as an equivalent of malloc(), save for the allocations being device-dereferenceable. I’m guessing, but your system likely also supports shared USM allocations, which are usable on both the host and the device.

Therefore: is there a way to avoid memory transfers? No, you must at the very least get the data from the host to the device and back again after the computation. However, it is possible to do no more than that, and to leave the data resident on the device should it be required at a later step. I don’t fully understand your second question, but perhaps my other answers have helped.

In short, my current advice is to allocate memory and create queues once, then use them for all subsequent launches, if possible, regardless of using USM or buffers. Obviously that’s not always possible.

I hope this helps,
Duncan.

1 Like

Hey @duncan,

Thanks for the great reply!

That’s a very sound suggestion in terms of caching objects, I had not even considered that. I will look into doing so for the queues themselves as they are more likely to be re-utilized in my case.

And I see, it sounds like for our case we won’t be able to avoid those transfers. This is good information to know!

My second question was more wondering if we could take a pointer to a region of memory on device that had been allocated using cudaMalloc() and operate on it directly with SYCL. My phrasing above was not the best, I was just thinking if there is a way to convert this pointer into a SYCL-supported object without going through the host first.

But it sounds like that may not currently be possible. Am I correct in this regard or did I misunderstand?

Thank you for your help!

Oh! So if using USM, you could do something like:

auto p = sycl::malloc_device<float>(1024, q);
p += 64;
q.submit( // normal stuff
  cgh.parallel_for([=] (item i) {
    p[i]; // effectively offsets the pointer
  });
});

you can do the same with buffers and accessors, though the APIs are obviously a bit different. It’s worth noting that when using the CUDA backend, I believe that malloc_device exactly corresponds to cudaMalloc.

Hey @duncan apologies for the late reply, just been out of the country and had missed getting back to this.

Firstly, thank you for the initial suggestion on caching, that led to some great improvements! For reference on the same GEMM call repeated 250 times:

  • Without caching we saw speeds of 4.62 seconds
  • With caching we observed 1.49 seconds

I hadn’t initially realized that the creation was so expensive, but this is great to know for future projects.

Second of all, in regards to the code snippet that you posted above, I don’t think that I fully understand it. I see that you allocate a region of device memory of size 1024 floats and store a pointer to it in p. Then you move that pointer ahead in memory by 64 floats. But I get confused at this point, what is the purpose of the queue submission? Is it that USM is also lazy and won’t be allocated until necessary? And what does the indexing of p[i] do to offset the pointer? (I was thinking that the incrementation of p before that would be the offset)

The queue submit was just an example of then using that pointer, nothing more. I don’t fully remember the context of our posts but I think I was trying to demonstrate that a USM pointer works exactly like a CUDA device pointer, so you can pre-compute the offset and use it in a kernel as normal.

1 Like

Gotcha, that makes sense and it should work for my use case. Thank you again for your help!

1 Like