AMD GPU SYCL Plugin urDeviceGetGlobalTimestamps Error

Hi,
I followed the steps from the website and configured everything the example on the website works.
Also the p2p_access example on the intel/llvm repository works with small changes :

$ ./a.out 
Running on devices:
0:      AMD Instinct MI210
1:      AMD Instinct MI210
PASS

But when I run my code which previously worked with NVIDIA GPUs runs perfectly fine with 1 GPU but it raises this error message for 2 GPU version :

SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_hip.so [ PluginVersion: 15.49.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 15.47.1 ]

UR HIP ERROR:
        Value:           400
        Name:            hipErrorInvalidHandle
        Description:     invalid resource handle
        Function:        urDeviceGetGlobalTimestamps
        Source Location: /tmp/tmp.GuKRlfBC0o/intel-llvm-mirror/build/_deps/unified-runtime-src/source/adapters/hip/device.cpp:989

terminate called after throwing an instance of 'ur_result_t'

I don’t know how to approach or debug this problem

Thank you in advance

Hi @br-ko,
which version of DPC++ are you using? Could you also try with a recent open-source daily build from github.com/intel/llvm/releases? We made some changes to the handling of multiple devices in recent months. The error looks like if a call was made to the HIP backend with a wrong / invalid device context.

If that doesn’t help, would you be able to share a minimal reproducer for this issue? It looks like you’re using two AMD GPUs and are submitting a kernel to a queue with the enable_profiling property, but it would be helpful to see how you create your queues and submit your kernels.

Thank you @rbielski , I will try the latest version from llvm. I installed via oneapi dpcpp standalone compiler version 2024.2.1

I will give it a try with LLVM and try to create a simple reproducing example

Also, I am just trying to double check from the documentation website if it is up-to-date in this :

is it still the case that AMD GPUs only support work group level atomics unlike the NVIDIA ones where device and system wide atomics are supported ?

That is in fact outdated, thank you for flagging this. All atomic scopes should be supported and we will update our documentation for the next release.

Apologies, it turns out the support was reverted due to some issues so the documentation is correct at the moment. We plan to reinstate the support for device and system scopes in a future release. You can check if a given context supports the scopes with get_info(), e.g. like this:

#include <sycl/sycl.hpp>
#include <iostream>

int main() {
  sycl::queue q{};
  auto scopes{q.get_context().get_info<sycl::info::context::atomic_memory_scope_capabilities>()};
  std::cout << q.get_device().get_info<sycl::info::device::name>()
	    << " supports " << scopes.size() << " atomic memory scopes:" << std::endl;
  for (sycl::memory_scope s : scopes) {
    switch (s) {
      case sycl::memory_scope::work_item : {
        std::cout << "  work_item" << std::endl;
	break;
      }
      case sycl::memory_scope::sub_group : {
        std::cout << "  sub_group" << std::endl;
        break;
      }
      case sycl::memory_scope::work_group : {
        std::cout << "  work_group" << std::endl;
        break;
      }
      case sycl::memory_scope::device : {
        std::cout << "  device" << std::endl;
        break;
      }
      case sycl::memory_scope::system : {
        std::cout << "  system" << std::endl;
        break;
      }
      default: {
        std::cout << "  unknown memory_scope value " << static_cast<int>(s) << std::endl;
	break;
      }
    }
  }
}

This indeed prints only:

AMD Instinct MI210 supports 3 atomic memory scopes:
  work_item
  sub_group
  work_group

in recent releases, but will also print device and system in the future once the support is added back.

Hi @rbielski thank you for the detailed feedback. I think I found the commit that adds the support for device and system level atomic operations is it this one ? [HIP] Enable device and system memory scope for atomics on HIP · oneapi-src/unified-runtime@08b19b2 · GitHub

I am confused how to get this commit from the intel/llvm repository ? Since for our application it is crucial to have system wide atomics with MI210.

Thank you again,

I forked the unified-runtime and modified only the line I pointed out from the earlier PR and now it prints out that it has device and system support so problem solved

Thanks!

Apologies,
Now with my code I found the issue

#include <sycl/sycl.hpp>
#include <iostream>
#define NUM_GPU 2
int main() {

  auto Devs = sycl::device::get_devices(sycl::info::device_type::gpu);


  std::vector<sycl::queue> Queues;
  // Insert not all devices only the required ones for model


  std::transform(Devs.begin(), Devs.begin() + NUM_GPU, std::back_inserter(Queues),
                 [](const sycl::device &D) { return sycl::queue{D,sycl::property::queue::enable_profiling{}}; });

  ////////////////////////////////////////////////////////////////////////
  if (Devs.size() > 1){
  if (!Devs[0].ext_oneapi_can_access_peer(
          Devs[1], sycl::ext::oneapi::peer_access::access_supported)) {
    std::cout << "P2P access is not supported by devices, exiting."
              << std::endl;

  }
  }
    std::cout <<"\n----------------------------------------"<< std::endl;

    std::cout << "Running on devices:" << std::endl;
    for(int i =0; i < Queues.size(); i++){

    std::cout << i << ":\t" << Queues[i].get_device().get_info<sycl::info::device::name>()
              << std::endl;
    }
std::cout <<"----------------------------------------"<< std::endl;
  // Enables Devs[x] to access Devs[y] memory and vice versa.
  if (Devs.size() > 1){
  for(auto gpuID_i = 0; gpuID_i < NUM_GPU; gpuID_i++) {
    for(auto gpuID_j = 0; gpuID_j < NUM_GPU; gpuID_j++) {
        if (gpuID_i != gpuID_j) {
              Devs[gpuID_i].ext_oneapi_enable_peer_access(Devs[gpuID_j]);
          }
    }
  }
  }







//   sycl::queue q{};
  auto scopes{Queues[0].get_context().get_info<sycl::info::context::atomic_memory_scope_capabilities>()};
  std::cout << Queues[0].get_device().get_info<sycl::info::device::name>()
	    << " supports " << scopes.size() << " atomic memory scopes:" << std::endl;
  for (sycl::memory_scope s : scopes) {
    switch (s) {
      case sycl::memory_scope::work_item : {
        std::cout << "  work_item" << std::endl;
	break;
      }
      case sycl::memory_scope::sub_group : {
        std::cout << "  sub_group" << std::endl;
        break;
      }
      case sycl::memory_scope::work_group : {
        std::cout << "  work_group" << std::endl;
        break;
      }
      case sycl::memory_scope::device : {
        std::cout << "  device" << std::endl;
        break;
      }
      case sycl::memory_scope::system : {
        std::cout << "  system" << std::endl;
        break;
      }
      default: {
        std::cout << "  unknown memory_scope value " << static_cast<int>(s) << std::endl;
	break;
      }
    }
  }
}

when I delete the enable_profiling it works but if not it still gives elapsedTime error how could I solve this problem @rbielski ?

Using this : Release DPC++ daily 2024-10-25 · intel/llvm · GitHub
(pre-built)

Hi @br-ko, can you paste the actual error you see please? I am not sure we’ve run into this before. Are you saying that when profiling is enabled, running this code exits with an error? Which line causes the error?

Hi @duncan ,

This is the output :

$ rm -rf a.out;clang++ -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a atomic_test.cpp ; ./a.out 
ZE_LOADER_DEBUG_TRACE:Using Loader Library Path: 
ZE_LOADER_DEBUG_TRACE:0 Drivers Discovered
<HIP>[ERROR]: 
UR HIP ERROR:
        Value:           400
        Name:            hipErrorInvalidHandle
        Description:     invalid resource handle
        Function:        getElapsedTime
        Source Location: /home/aac/llvm-nightly-2024-10-25/build/_deps/unified-runtime-src/source/adapters/hip/device.cpp:31


----------------------------------------
Running on devices:
0:      AMD Instinct MI210
1:      AMD Instinct MI210
----------------------------------------
AMD Instinct MI210 supports 5 atomic memory scopes:
  work_item
  sub_group
  work_group
  device
  system

This is the run command :

export DPCPP_HOME=/home/aac/
export PATH=$DPCPP_HOME/llvm-nightly-2024-10-25/build/bin:$PATH;export LD_LIBRARY_PATH=$DPCPP_HOME/llvm-nightly-2024-10-25/build/lib:$LD_LIBRARY_PATH
rm -rf a.out;clang++ -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a atomic_test.cpp ; ./a.out

With this snippet it continues, but with my actual code it stops executing when I try to capture the profiling information with this function :

double GetExecutionTime(const event &e) {
  double start_k = e.get_profiling_info<info::event_profiling::command_start>();
  double end_k = e.get_profiling_info<info::event_profiling::command_end>();
  double kernel_time = (end_k - start_k) * 1e-6; // ns to ms
  return kernel_time;
}
...


           start_time[gpuID] = executeEvent[gpuID].get_profiling_info<info::event_profiling::command_start>();

EDIT: It works for 1 GPU but fails for 2 GPU

Thank you for the reproducer. We’re getting things ready to try this in a similar setup to yours and will let you know as soon as we know more.

Hi @br-ko,
that is indeed a runtime bug, many thanks for reporting this. It should be fixed in this PR:

Hi @rbielski ,
Thank you for the quick fix! have a nice weekend!