Hello Everyone,
I am attempting to write custom PyTorch kernels utilizing SYCL and am noticing some slowdowns compared to the PyTorch CUDA backend. This is likely due to my design choices in SYCL (i.e. buffers instead of USM), so please let me know if you see any choices that may be hurting performance.
Specs for testing:
- OS: Red Hat Enterprise Linux 8.10 (Ootpa)
- CPU: AMD EPYC 7742
- GPU: NVIDIA A100-SXM4-80GB
- GPU Driver: 555.42.02
- OneAPI Version: 2024.0
- OneMKL Version: 0.4
- Built for CUDA
- Python Version: 3.9.6
This is a very simple example that calls the oneAPI oneMKL GEMM function with the cuBLAS backend. I chose this example because I believe that PyTorch utilizes cuBLAS as well for simple matrix multiplications.
The SYCL code is as follows:
#include <sycl/sycl.hpp>
#include "oneapi/mkl/blas.hpp"
#include <vector>
#include <cmath>
#include <iostream>
using namespace sycl;
namespace mkl = oneapi::mkl;
extern "C" void gemm_test(float* qq, int Q, float* kk, int K, int Embedded_D, float* out) {
// Create the queue.
queue q(gpu_selector_v);
// Create the buffers.
auto qQ = buffer(qq, range<1> { static_cast<size_t>(Q * Embedded_D) });
auto kK = buffer(kk, range<1> { static_cast<size_t>(K * Embedded_D) });
auto oO = buffer(out, range<1> { static_cast<size_t>(Q * K) });
// Variables for the GEMM operation.
int64_t m = Q;
int64_t n = K;
int64_t k = Embedded_D;
int64_t ldq = k;
int64_t ldk = k;
int64_t ldt = n;
float alpha = 1.0;
float beta = 0.0;
// The actual GEMM call.
mkl::blas::column_major::gemm(
oneapi::mkl::backend_selector<oneapi::mkl::backend::cublas>{ q },
mkl::transpose::trans,
mkl::transpose::nontrans,
n,
m,
k,
alpha,
kK,
ldk,
qQ,
ldq,
beta,
oO,
ldt
);
// Waiting for the operation to complete.
q.wait();
}
This is compiled to a shared object that I then call from Python using the ctypes FFI. The compilation process was:
- icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 -fPIC -I/home/nkt8/oneMKL_CUDA_backend/include gt.cpp -c
- icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 -fPIC -shared -o mylib.so gt.o -L/home/nkt8/oneMKL_CUDA_backend/lib -lonemkl -lonemkl_blas_cublas -lm
The code for benchmarking is as follows:
import time
import torch
from ctypes import CDLL, POINTER, c_float, c_int, cast, c_void_p
# Store the specified dimensions.
Q = 2048
K = 4096
D = 1024
# Set a default data type.
torch.set_default_dtype(torch.float32)
# Create random Q and K matrices on the GPU.
q = torch.rand([Q, D], device="cuda")
k = torch.rand([K, D], device="cuda")
# Time 250 iterations of the native PyTorch matrix multiplication.
start = time.monotonic()
for _ in range(250):
out = q @ k.T
torch.cuda.synchronize()
stop = time.monotonic()
print(f"Torch time: {stop - start} s")
# Create matrices of the same size on the CPU.
q_c = torch.rand([Q, D], device="cpu")
k_c = torch.rand([K, D], device="cpu")
# Time PyTorch with forced copies.
start = time.monotonic()
for _ in range(250):
# Force copies to the GPU.
q_c = q_c.cuda()
k_c = k_c.cuda()
out = q @ k.T
torch.cuda.synchronize()
# Force copies to the CPU.
out = out.cpu()
q_c = q_c.cpu()
k_c = k_c.cpu()
stop = time.monotonic()
print(f"Torch time with copy: {stop - start} s")
# Shorthand float*
cust = POINTER(c_float)
# SYCL function testing.
lib = CDLL("/home/nkt8/SC_24/sycl/dev_test/mylib.so")["gemm_test"]
lib.restype = None
lib.argtypes = [cust, c_int, cust, c_int, c_int, cust]
# Preallocate the output on the GPU.
outer = torch.zeros([Q, K], device="cuda")
# Cast the raw GPU pointers to float*
q_in = cast(q.data_ptr(), cust)
k_in = cast(k.data_ptr(), cust)
o_in = cast(outer.data_ptr(), cust)
# This is just to verify that the pointer addresses are not identical.
print(hex(q.data_ptr())) # Should be on the GPU.
print(hex(q.cpu().numpy().ctypes.data)) # Should be on the CPU.
start = time.monotonic()
for _ in range(250):
lib(q_in, Q, k_in, K, D, o_in)
stop = time.monotonic()
print(f"SYCL time (GPU pointers): {stop - start} s")
# A second example using memory allocated on the CPU instead of GPU
# Create the output buffer on the CPU.
outer = torch.zeros([Q, K], device="cpu")
# Cast the data pointers to float*
q_in = cast(q_c.data_ptr(), cust)
k_in = cast(k_c.data_ptr(), cust)
o_in = cast(outer.data_ptr(), cust)
# Verify that the data pointer matches that of the data "moved" (should not happen) to the CPU
print(hex(q_c.data_ptr()))
print(hex(q_c.cpu().numpy().ctypes.data))
start = time.monotonic()
for _ in range(250):
lib(q_in, Q, k_in, K, D, o_in)
stop = time.monotonic()
print(f"SYCL time (CPU pointers): {stop - start} s")
print("The FFI finished.")
I tried to create suitably large tensors and ran each operation 250 times so we can get an idea of how the function behaves over many calls. I also made sure to synchronize the PyTorch operations to match queue.wait() on the SYCL side.
There are two variants of the PyTorch:
- Tensors created in the GPU and operated on in the GPU (they never leave)
- Tensors created on the CPU, moved to the GPU, operated on in the GPU, and then moved back to the CPU.
These two examples were created to try and match the two variants of pointers passed to the SYCL code:
- Tensors created on the GPU by PyTorch and a pointer to their memory address in device memory.
- Tensors created on the CPU by PyTorch and a pointer to their memory address in host memory.
The timing results are relatively consistent for PyTorch, but are more varied for SYCL. Here are the results for three different runs of the bench script (in seconds):
- PyTorch GPU only:
a. 0.382 s
b. 0.354 s
c. 0.341 s - PyTorch with copies:
a. 7.491 s
b. 7.134 s
c. 7.040 s - SYCL with GPU pointers:
a. 3.058 s
b. 7.802 s
c. 8.006 s - SYCL with CPU pointers:
a. 5.275 s
b. 14.630 s
c. 5.851 s
The output results from SYCL match those of PyTorch on both the GPU and CPU pointer examples (just in column-major order when returned, but that’s not an issue). Even the fastest benches of the SYCL code are on the order of ~10-20x slower than PyTorch without memory transfers. I thought this may have been due to implicit memory transfers; this is why I added the PyTorch bench with forced copies to get an idea of the expected slowdown (even though we miss one transfer on the preallocated output buffer).
Looking at the SYCL reference for buffers (https://github.com/KhronosGroup/SYCL_Reference/blob/main/source/iface/buffer.rst) I notice that all buffer constructors deal with host memory (or ranges and iterators). By using buffers, is SYCL implicitly copying GPU memory to the host before moving it back to the device for GEMM?
I initially avoided USM because the reference (https://github.com/KhronosGroup/SYCL_Reference/blob/main/source/iface/usm_allocations.rst) appears to show that it actually performs an allocation on the specified host/device.
So my main questions can be boiled down to:
- Is there a way to avoid my hypothesized memory transfers?
- Could we convert a pointer to USM device memory without having to allocate and copy?
Also, if anyone has any ideas as to the inconsistencies in runtimes between the SYCL benchmarks, I’d love to hear them. (I’m wondering if this could be due to my choice to use dynamic linked libraries on the Python side, but am not positive)
Thank you for reading my post! Please let me know if anything is unclear and I will elaborate.