Hi,
I have encountered what I think is an incorrect implementation of the hierarchical parallel_for_work_group
/ parallel_for_work_item
construct (SYCL Specification version 1.2.1 revision 5, section 4.8.5.3, Parallel For hierarchical invoke).
According to the specifications (and common sense) the code in the outer lambda should be executed once per work group:
The body of the outer
parallel_for_work_group
call consists of a lambda function or function object. The body of this function object contains code that is executed only once for the entire work-group. If the code has no side-effects and the compiler heuristic suggests that it is more efficient to do so, this code will be executed for each work-item.
Instead in my tests it looks like the code in the outer lambda is executed once per work item.
Here is an example showing this issue:
#include <SYCL/sycl.hpp>
#include <cstdio>
#include <iostream>
class count_groups;
int main() {
// data size
const int size = 10000;
#if defined SYCL_TARGET_HOST
// select the host device
auto device = cl::sycl::device(cl::sycl::host_selector());
#elif defined SYCL_TARGET_SPIR
// select the default device
auto device = cl::sycl::device(cl::sycl::default_selector());
#endif
// get the number of compute units, threads per block, and determine the total number of threads
size_t work_group_size = device.get_info<cl::sycl::info::device::max_work_group_size>();
size_t num_work_groups = (size + work_group_size - 1) / work_group_size;
std::cout << "Compute configuration for " << size << " elements:" << std::endl;
std::cout << " work group size: " << work_group_size << std::endl;
std::cout << " number of groups: " << num_work_groups << std::endl;
// construct a SYCL queue for the selected device
auto queue = cl::sycl::queue(device);
// SYCL buffers
int counter = 0;
auto counter_buf = cl::sycl::buffer<int>(& counter, 1);
// submit the kernel to the queue
queue.submit([&](cl::sycl::handler &cgh) {
// access the SYCL buffers from the device kernel
auto counter_d = counter_buf.get_access<cl::sycl::access::mode::atomic>(cgh);
// launch the kernel
cgh.parallel_for_work_group<class count_groups>(
cl::sycl::range<1>{num_work_groups},
cl::sycl::range<1>{work_group_size},
[=](cl::sycl::group<1> group) {
// print the id of all the groups
printf("group id: %lu\n", group.get_id(0));
// print the id of all the threads
group.parallel_for_work_item([&](cl::sycl::h_item<1> item) {
printf("global thread id: %zu\n", item.get_global_id(0));
});
cl::sycl::atomic_fetch_add(counter_d[0], 1);
});
});
auto counter_h = counter_buf.get_access<cl::sycl::access::mode::read>();
std::cout << "Number of group iterations: " << counter_h[0] << " (should be " << num_work_groups << ")" << std::endl;
}
Built with
/opt/ComputeCpp/bin/compute++ -sycl-driver -no-serial-memop -O2 -std=c++17 -g -I /opt/ComputeCpp/include -L /opt/ComputeCpp/lib -lComputeCpp -DSYCL_TARGET_HOST test.cc -o test-sycl-host
to force using the host “device”, it works as expected:
Compute configuration for 10000 elements:
work group size: 1024
number of groups: 10
group id: 0
global thread id: 0
global thread id: 1
global thread id: 2
...
global thread id: 1022
global thread id: 1023
group id: 1
global thread id: 1024
global thread id: 1025
...
global thread id: 10238
global thread id: 10239
Number of group iterations: 10 (should be 10)
Built with
/opt/ComputeCpp/bin/compute++ -sycl-driver -no-serial-memop -O2 -std=c++17 -g -I /opt/ComputeCpp/include -L /opt/ComputeCpp/lib -lComputeCpp -DSYCL_TARGET_SPIR test.cc -o test-sycl-spir
to run on an OpenCL device, it gives the unexpected result:
Compute configuration for 10000 elements:
work group size: 256
number of groups: 40
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
group id: 3
...
group id: 16
group id: 16
group id: 16
group id: 16
group id: 16
global thread id: 2336
global thread id: 2337
global thread id: 2338
global thread id: 2339
global thread id: 2340
...
global thread id: 9436
global thread id: 9437
global thread id: 9438
global thread id: 9439
Number of group iterations: 10240 (should be 40)
In fact, the result is 256 times the expected one, and each group id is printed 256 times, as many as there are work items:
cat device.log | grep 'group id:' | sort -V | uniq -c
256 group id: 0
256 group id: 1
256 group id: 2
256 group id: 3
256 group id: 4
256 group id: 5
256 group id: 6
256 group id: 7
256 group id: 8
256 group id: 9
256 group id: 10
256 group id: 11
256 group id: 12
256 group id: 13
256 group id: 14
256 group id: 15
256 group id: 16
256 group id: 17
256 group id: 18
256 group id: 19
256 group id: 20
256 group id: 21
256 group id: 22
256 group id: 23
256 group id: 24
256 group id: 25
256 group id: 26
256 group id: 27
256 group id: 28
256 group id: 29
256 group id: 30
256 group id: 31
256 group id: 32
256 group id: 33
256 group id: 34
256 group id: 35
256 group id: 36
256 group id: 37
256 group id: 38
256 group id: 39
I am currently using ComputeC++ Community Edition version 1.1.2 on Ubuntu 18.04, with the lates Intel OpenCL drivers:
$ computecpp_info
********************************************************************************
ComputeCpp Info (CE 1.1.2)
SYCL 1.2.1 revision 3
********************************************************************************
Toolchain information:
GLIBC version: 2.27
GLIBCXX: 20160609
This version of libstdc++ is supported.
********************************************************************************
Device Info:
Discovered 2 devices matching:
platform : <any>
device type : <any>
--------------------------------------------------------------------------------
Device 0:
Device is supported : UNTESTED - Untested OS
CL_DEVICE_NAME : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
CL_DEVICE_VENDOR : Intel(R) Corporation
CL_DRIVER_VERSION : 18.1.0.0920
CL_DEVICE_TYPE : CL_DEVICE_TYPE_CPU
--------------------------------------------------------------------------------
Device 1:
Device is supported : UNTESTED - Untested OS
CL_DEVICE_NAME : Intel(R) Gen9 HD Graphics NEO
CL_DEVICE_VENDOR : Intel(R) Corporation
CL_DRIVER_VERSION : 19.19.12968
CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU
If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v1.1.2/platform-support-notes
********************************************************************************
Please let me know if I can provide any further information.
Thank you,
.Andrea