Atomic Operation with USM

br-ko · 13 May 2024 09:06

Hello,
We are using the NVIDIA A100 GPU with Codeplay plugin for SYCL. we are having hard time implementing the atomics since it gives wrong results. This is the basic version of the code we would like to implement where if the usm_updating_mask element is 1 , the index of it will be pushed to the usm_pipe

for( int tid =0; tid < no_of_nodes;tid++){
              char condition = usm_updating_mask[tid];
              if(condition){
                usm_pipe[iter] = tid;
                iter++; 
              }
            } 
          


          d_over[0] = iter;

So our initial attempt was this but our results are not valid :

    const size_t local_size = 16;  // Number of work-items per work-group
    const size_t global_size = ((no_of_nodes + local_size - 1) / local_size) * local_size;

    // Setup the range
    nd_range<1> range(global_size, local_size);

        auto e = q.parallel_for<class PipeGenerator>(range, [=](nd_item<1> item) [[intel::kernel_args_restrict]] {
          int tid = item.get_global_linear_id();
          sycl::atomic_ref<unsigned int, sycl::memory_order_relaxed,
        sycl::memory_scope_device,sycl::access::address_space::global_space>
        atomic_op_global(d_over[0]);
          if (tid < no_of_nodes) {
            
              char condition = usm_updating_mask[tid];
              if(condition){
                usm_pipe[atomic_op_global] = tid;
                atomic_op_global+=1; 
              }
            } 
        

        });

Is it valid to update atomic_op_global and also at the same time access it ?
Thank you

rbielski · 13 May 2024 11:02

Yes, it is valid to do this at the same time, but your code is not doing that. These are two separate operations:

usm_pipe[atomic_op_global] = tid;
atomic_op_global+=1;

The first line does a read, the second line does a read-write. There is no guarantee about ordering of these two calls between threads, so it will inevitably happen that two threads will read the same value in the first line and then will both atomically increment it in the second line. This will write to usm_pipe[n] twice concurrently (race condition) and result in the index jumping to n+2 (with the value n+1 never used).

What you’re looking for is:

usm_pipe[atomic_op_global.fetch_add(1)] = tid;

This results in a single atomic read-write, where the value is first used to index usm_pipe and then incremented.

Note that SYCL atomics closely mimic the C++ atomics. You code is equivalent to this pure C++ reproducer which has the same problem:

#include <thread>
#include <atomic>
#include <vector>
#include <cstdio>
#include <cassert>

int main() {
    constexpr static int N{100};
    int arr[N] = {};
    std::atomic_int i{0};
    std::vector<std::thread> threads;
    for (int t{0}; t<N; ++t) {
        threads.emplace_back([&i,&arr,t](){
            arr[i] = t+1;
            i+=1;
        });
    }
    for (std::thread& thread : threads) {
        thread.join();
    }
    for (int t{0}; t<N; ++t) {
        assert(arr[t]>0);
    }
}

This randomly fails the assertion in some runs. Analogously to the SYCL example, changing the thread lambda to:

arr[i.fetch_add(1)] = t+1;

makes this work reliably.

br-ko · 13 May 2024 21:55

Great! Thank you again for detailed explanation this clears the confusion and achieves the required results

Topic		Replies	Views
Atomics on AMD MI250x oneAPI for AMD GPUs	1	70	20 December 2024
Dose computeCpp support usm?	2	733	24 December 2020
Question about USM SYCL development	8	1395	1 June 2020
GPU optimisation of memory accesses	2	102	12 June 2024
Cuda_error_illegal_address oneAPI for NVIDIA GPUs	4	289	28 March 2024

Atomic Operation with USM

Related topics