Atomic Operation with USM

Hello,
We are using the NVIDIA A100 GPU with Codeplay plugin for SYCL. we are having hard time implementing the atomics since it gives wrong results. This is the basic version of the code we would like to implement where if the usm_updating_mask element is 1 , the index of it will be pushed to the usm_pipe

for( int tid =0; tid < no_of_nodes;tid++){
              char condition = usm_updating_mask[tid];
              if(condition){
                usm_pipe[iter] = tid;
                iter++; 
              }
            } 
          


          d_over[0] = iter;

So our initial attempt was this but our results are not valid :

    const size_t local_size = 16;  // Number of work-items per work-group
    const size_t global_size = ((no_of_nodes + local_size - 1) / local_size) * local_size;

    // Setup the range
    nd_range<1> range(global_size, local_size);

        auto e = q.parallel_for<class PipeGenerator>(range, [=](nd_item<1> item) [[intel::kernel_args_restrict]] {
          int tid = item.get_global_linear_id();
          sycl::atomic_ref<unsigned int, sycl::memory_order_relaxed,
        sycl::memory_scope_device,sycl::access::address_space::global_space>
        atomic_op_global(d_over[0]);
          if (tid < no_of_nodes) {
            
              char condition = usm_updating_mask[tid];
              if(condition){
                usm_pipe[atomic_op_global] = tid;
                atomic_op_global+=1; 
              }
            } 
        

        });

Is it valid to update atomic_op_global and also at the same time access it ?
Thank you

Yes, it is valid to do this at the same time, but your code is not doing that. These are two separate operations:

usm_pipe[atomic_op_global] = tid;
atomic_op_global+=1; 

The first line does a read, the second line does a read-write. There is no guarantee about ordering of these two calls between threads, so it will inevitably happen that two threads will read the same value in the first line and then will both atomically increment it in the second line. This will write to usm_pipe[n] twice concurrently (race condition) and result in the index jumping to n+2 (with the value n+1 never used).

What you’re looking for is:

usm_pipe[atomic_op_global.fetch_add(1)] = tid;

This results in a single atomic read-write, where the value is first used to index usm_pipe and then incremented.

Note that SYCL atomics closely mimic the C++ atomics. You code is equivalent to this pure C++ reproducer which has the same problem:

#include <thread>
#include <atomic>
#include <vector>
#include <cstdio>
#include <cassert>

int main() {
    constexpr static int N{100};
    int arr[N] = {};
    std::atomic_int i{0};
    std::vector<std::thread> threads;
    for (int t{0}; t<N; ++t) {
        threads.emplace_back([&i,&arr,t](){
            arr[i] = t+1;
            i+=1;
        });
    }
    for (std::thread& thread : threads) {
        thread.join();
    }
    for (int t{0}; t<N; ++t) {
        assert(arr[t]>0);
    }
}

This randomly fails the assertion in some runs. Analogously to the SYCL example, changing the thread lambda to:

arr[i.fetch_add(1)] = t+1;

makes this work reliably.

1 Like

Great! Thank you again for detailed explanation this clears the confusion and achieves the required results

1 Like