How does AMD/Nvidia Plugins work?

I am referring to this page when I say AMD plugin. I want to understand how this support works underneath. I have been reading this build instructions from intel/llvm. It has instructions to build for AMD/Nvidia devices as well. Is your plugin essentially automating this build?

I see also another reference to unified runtime in the the above linked instructions. That mentions nvidia and AMD GPUs as well. What is the relation of your plugins to this and Intel’s Level zero runtime?

I have not been able to find one single document for all this - so please excuse my ignorance. I am quite interested in platform agnostic runtimes. My earlier work with portability can be found here: PortBLAS: Open Source BLAS that is Performance-Portable Across GPUs - Sasank's Blog

Hi @chsasank,

These are great questions.I’ll try to answer them in a way that makes sense to me and how I think about the stack.

Unified Runtime is an abstraction layer for heterogeneous computing that allows the Intel SYCL implementation to communicate with multiple different backend implementations. For example, in SYCL I might call queue::copy(src, dest, size); the SYCL runtime then calls something like urEnqueueMemCpyDeviceDevice (I don’t know if this exact function exists but it is a very plausible name for a UR function). At this point, Unified Runtime then decides which backend library to translate the call into - be it CUDA, HIP, L0 or OpenCL. Codeplay distributes the plugins for CUDA and HIP.

There are other future plans for UR which might come into fruition but this is what its main purpose is today. It’s also worth noting that the current code looks a little bit different because of another pre-existing abstraction layer called PI that is still referenced in the code as we move from one to the other.

I hope this makes sense, so please ask if there’s anything I can clarify.

Duncan.

Thanks for the reply, @duncan.

If I understand the stack right:

DPC++ → Sycl runtime → Unified runtime → L0/HIP/CUDA.

I have read through some of the unified runtime code. Quite easy to understand actually. So the unified runtime takes in binary/device-specific IR and abstracts out host side logic. If I generate these kernels/binaries using alternative method, I can essentially use unified runtime as an abstraction for device driver. Do I understand this right?

In theory yeah, that would work. There are discussions in LLVM about UR becoming more of a part of upstream LLVM that would presumably further enable this sort of project, though I would say that for the moment the easiest way to use it is probably through the SYCL interface.

[RFC] Introducing `llvm-project/offload` - #31 by alycm - LLVM Project - LLVM Discussion Forums for more information, though obviously nothing here has been decided or actioned. But we’re interested.

Thanks. I am interested in writing a front-end different from C++. While I love the abstractions in SYCL, I really wish they are available in different languages. Specifically I am looking to make kernel fusion/optimization easy to do and therefore obviate the need for API based programming.

I have read the paper/blog that moved some of the back end to MLIR. I wonder if I can target sycl-mlir dialect to do high-level device independent code generation which does kernel fusion and leave the rest of the optimizations and device level compilations to sycl compiler.

This sounds like a really interesting idea! When you make progress we’d be really interested to hear how you are getting on, as there are a lot of potentially good solutions out there still to be discovered for making fast GPU compute a reality.