Incorrect behaviour of parallel_for_work_group

Hi,
I have encountered what I think is an incorrect implementation of the hierarchical parallel_for_work_group / parallel_for_work_item construct (SYCL Specification version 1.2.1 revision 5, section 4.8.5.3, Parallel For hierarchical invoke).

According to the specifications (and common sense) the code in the outer lambda should be executed once per work group:

The body of the outer parallel_for_work_group call consists of a lambda function or function object. The body of this function object contains code that is executed only once for the entire work-group. If the code has no side-effects and the compiler heuristic suggests that it is more efficient to do so, this code will be executed for each work-item.

Instead in my tests it looks like the code in the outer lambda is executed once per work item.

Here is an example showing this issue:

#include <SYCL/sycl.hpp>

#include <cstdio>
#include <iostream>

class count_groups;

int main() {
  // data size
  const int size = 10000;

#if defined SYCL_TARGET_HOST
  // select the host device
  auto device = cl::sycl::device(cl::sycl::host_selector());
#elif defined SYCL_TARGET_SPIR
  // select the default device
  auto device = cl::sycl::device(cl::sycl::default_selector());
#endif

  // get the number of compute units, threads per block, and determine the total number of threads
  size_t work_group_size = device.get_info<cl::sycl::info::device::max_work_group_size>();
  size_t num_work_groups = (size + work_group_size - 1) / work_group_size;
  std::cout << "Compute configuration for " << size << " elements:" << std::endl;
  std::cout << "  work group size:  " << work_group_size << std::endl;
  std::cout << "  number of groups: " << num_work_groups << std::endl;

  // construct a SYCL queue for the selected device
  auto queue = cl::sycl::queue(device);

  // SYCL buffers
  int counter = 0;
  auto counter_buf = cl::sycl::buffer<int>(& counter, 1);

  // submit the kernel to the queue
  queue.submit([&](cl::sycl::handler &cgh) {
    // access the SYCL buffers from the device kernel
    auto counter_d = counter_buf.get_access<cl::sycl::access::mode::atomic>(cgh);

    // launch the kernel
    cgh.parallel_for_work_group<class count_groups>(
        cl::sycl::range<1>{num_work_groups},
        cl::sycl::range<1>{work_group_size},
        [=](cl::sycl::group<1> group) {

      // print the id of all the groups
      printf("group id: %lu\n", group.get_id(0));

      // print the id of all the threads
      group.parallel_for_work_item([&](cl::sycl::h_item<1> item) {
        printf("global thread id: %zu\n", item.get_global_id(0));
      });

      cl::sycl::atomic_fetch_add(counter_d[0], 1);
    });
  });

  auto counter_h = counter_buf.get_access<cl::sycl::access::mode::read>();
  std::cout << "Number of group iterations: " << counter_h[0] << " (should be " << num_work_groups << ")" << std::endl;
}

Built with

/opt/ComputeCpp/bin/compute++ -sycl-driver -no-serial-memop -O2 -std=c++17 -g -I /opt/ComputeCpp/include -L /opt/ComputeCpp/lib -lComputeCpp -DSYCL_TARGET_HOST test.cc -o test-sycl-host

to force using the host “device”, it works as expected:

Compute configuration for 10000 elements:
  work group size:  1024
  number of groups: 10
group id: 0
global thread id: 0
global thread id: 1
global thread id: 2
...
global thread id: 1022
global thread id: 1023
group id: 1
global thread id: 1024
global thread id: 1025
...
global thread id: 10238
global thread id: 10239
Number of group iterations: 10 (should be 10)

Built with

/opt/ComputeCpp/bin/compute++ -sycl-driver -no-serial-memop -O2 -std=c++17 -g -I /opt/ComputeCpp/include -L /opt/ComputeCpp/lib -lComputeCpp -DSYCL_TARGET_SPIR test.cc -o test-sycl-spir

to run on an OpenCL device, it gives the unexpected result:

Compute configuration for 10000 elements:
  work group size:  256
  number of groups: 40
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
...
group id: 16
group id: 16
group id: 16
group id: 16
group id: 16
global thread id: 2336
global thread id: 2337
global thread id: 2338
global thread id: 2339
global thread id: 2340
...
global thread id: 9436
global thread id: 9437
global thread id: 9438
global thread id: 9439
Number of group iterations: 10240 (should be 40)

In fact, the result is 256 times the expected one, and each group id is printed 256 times, as many as there are work items:

cat device.log | grep 'group id:' | sort -V | uniq -c
    256 group id: 0
    256 group id: 1
    256 group id: 2
    256 group id: 3
    256 group id: 4
    256 group id: 5
    256 group id: 6
    256 group id: 7
    256 group id: 8
    256 group id: 9
    256 group id: 10
    256 group id: 11
    256 group id: 12
    256 group id: 13
    256 group id: 14
    256 group id: 15
    256 group id: 16
    256 group id: 17
    256 group id: 18
    256 group id: 19
    256 group id: 20
    256 group id: 21
    256 group id: 22
    256 group id: 23
    256 group id: 24
    256 group id: 25
    256 group id: 26
    256 group id: 27
    256 group id: 28
    256 group id: 29
    256 group id: 30
    256 group id: 31
    256 group id: 32
    256 group id: 33
    256 group id: 34
    256 group id: 35
    256 group id: 36
    256 group id: 37
    256 group id: 38
    256 group id: 39

I am currently using ComputeC++ Community Edition version 1.1.2 on Ubuntu 18.04, with the lates Intel OpenCL drivers:

$ computecpp_info 
********************************************************************************

ComputeCpp Info (CE 1.1.2)

SYCL 1.2.1 revision 3

********************************************************************************

Toolchain information:

GLIBC version: 2.27
GLIBCXX: 20160609
This version of libstdc++ is supported.

********************************************************************************


Device Info:

Discovered 2 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : UNTESTED - Untested OS
  CL_DEVICE_NAME                          : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       : 18.1.0.0920
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_CPU 
--------------------------------------------------------------------------------
Device 1:

  Device is supported                     : UNTESTED - Untested OS
  CL_DEVICE_NAME                          : Intel(R) Gen9 HD Graphics NEO
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       : 19.19.12968
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU 

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v1.1.2/platform-support-notes

********************************************************************************

Please let me know if I can provide any further information.

Thank you,
.Andrea

Hi Andrea,
You have stumbled across a bug. Thanks for the detailed and clear repro case.
I’ve raised an internal issue (ref 847) to get this resolved. I’ll update you on any progress with it.
Thanks,
Rod.