Using nd_range with parallel_for does not work

Hello my SYCL friends, it’s me again,

I’ve been trying to launch a kernel using “nd_range”. I want to launch a kernel and ND-range of 16x16 divided into work-groups of 4x4. The code below shows that

int main (){


constexpr int G= 16;
constexpr int L = 4;
float orbital[G*G];

{//SYCL scope
    cl::sycl::gpu_selector device; //Gpu device
    cl::sycl::queue Q{device}; //command queue with the selected device

    //Just for information: prints the device in use
    std::cout << "Running on " << Q.get_device().get_info<sycl::info::device::name>()<< "\n";
    cl::sycl::buffer<float, 1> buffOrb{orbital,cl::sycl::range<1>(G*G)};

    Q.submit([&](cl::sycl::handler &cgh){

        cgh.parallel_for<orbital_function>(cl::sycl::nd_range<2>(cl::sycl::range<2> {16,16},cl::sycl::range<2> {4,4}), [=](cl::sycl::nd_item<2> index){

        });

    });


}//END SYCL SCOPE





return 0;
}

But when I compile I get an error with this message:

Running on GeForce GTX 1060 with Max-Q Design
 terminate called after throwing an instance of 'cl::sycl::compile_program_error'
 Aborted (core dumped)

I compiled using:

compute++  -sycl-driver -sycl-target ptx64 work_items.cpp -I /usr/local/computecpp/include/ -lOpenCL -L /usr/local/computecpp/lib/ -lComputeCpp -o work_items.exe 

-Juan

Hi,

it’s not quite clear what might be the problem: compile_program_error indicates there might be something wrong with the kernel, but the kernel is empty. I’d recommend adding error checking, you can look at some ComputeCpp SDK samples to see how that’s done.

Additionally, you can try enabling verbose output to get more information from the ComputeCpp runtime, see for example this post on how to enable it.

Have you observed different behavior depending on the ComputeCpp version used? Which version are you using at the moment?

Hello,

This is my ComputeCpp version:

Codeplay ComputeCpp - CE 2.0.0 Device Compiler

My problem arises when I used the

and

because when my kernel uses the

cl::sycl::range<1>(N), [=](cl::sycl::id<1> index)

Everything works fine.

I’m gonna try what you mentioned before

-Juan

Hi,
Have you added error checking into your code? This will give a more specific error message that we can use to understand where the problem is happening. Peter has pointed to some advice there, but I would also recommend looking at the materials here, sample code and accompanying video here.

Hello @peterzuzek @rod

I added the verbose mode and I obtained this:

ComputeCpp>  (CE 2.0.0)
ComputeCpp>  (Assuming platform supports SPIR-V)
ComputeCpp>  (Selected Platform: NVIDIA CUDA)
ComputeCpp>  (Selected Device: GeForce GTX 1060 with Max-Q Design)
Running on GeForce GTX 1060 with Max-Q Design
ComputeCpp> Error: [ComputeCpp:RT0100] Failed to build program
ComputeCpp> -> With OpenCL Code -9999: The Error string is not available
ComputeCpp> -> Triggered at: program_detail.cpp (567)
ComputeCpp>  (Build log for program 0x1638580 device 0 (size 69):
ptxas fatal   : Unresolved extern function '_Z17get_global_offsetj'


)
ComputeCpp> Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 0x1638580 device 0 (size 69):
ptxas fatal   : Unresolved extern function '_Z17get_global_offsetj'


>


)
ComputeCpp> -> With OpenCL Code -9999: The Error string is not available
ComputeCpp> -> Triggered at: program_detail.cpp (778)
terminate called after throwing an instance of 'cl::sycl::compile_program_error'
Aborted (core dumped)

I added error checking as @rod suggested and I had no new messages

Hello there.
The error is saying that the built-in you are trying to use does not exist. So this code cannot execute on the Nvidia GPU through OpenCL and ptx.

Not all OpenCL built-ins have been implemented for the ComputeCpp ptx implementation which is why it is in an experimental status, and we point developers towards using the DPC++ for CUDA implementation we have done if they want to target Nvidia GPUs.

One work around if you are determined to use ComputeCpp with this hardware is to figure out what method call is triggering the error and finding a way to make an alternative call that is working.
Rod.

I’m going to try the DPC++for cuda. I really appreciate your help Rod.

Additionally, I want to do what you mentioned: figure out what method call is triggering the error.

I’ll keep you informed.

Juan.

Hello @rod,

This is the source code I’ve been working on:

#include<CL/sycl.hpp>
#include<iostream>


class mat_mul;


// Example device selector, looks for an Intel GPU:
class intel_gpu_selector : public sycl::device_selector {
 public:
  int operator()(const cl::sycl::device& dev) const override {
    if (dev.has(cl::sycl::aspect::gpu)) {
      auto vendorName = dev.get_info<cl::sycl::info::device::vendor>();
      if (vendorName.find("Intel") != std::string::npos) {
        return 1;
      }
    }
    return -1;
  }
};

int main (){
    
    //auto Q = cl::sycl::queue{intel_gpu_selector{}}; //Intel GPU Works totally fine
    auto Q = cl::sycl::queue{cl::sycl::gpu_selector{}}; //Targeting NVIDIA GPU: here is the problem
    
    std::cout << "Chosen device: "  << Q.get_device().get_info<cl::sycl::info::device::name>()<<std::endl;
    std::cout<< "Max Work Group Size: "<< Q.get_device().get_info<cl::sycl::info::device::max_work_group_size>()<<std::endl;
    int N = 8;
    int M = 8;
    float c_mat[N*M];
    {
        //Buffer creation:
        cl::sycl::buffer<float, 1> buffc{c_mat, cl::sycl::range<1>{static_cast<size_t>(N*M)}};

        Q.submit([&](cl::sycl::handler &cgh){

            //Accessro creation:
            auto acc_matC = buffc.get_access<cl::sycl::access::mode::write>(cgh);

             // I think the problem is the use of "nd_range" with the NVIDIA GPU
            auto myRange = cl::sycl::nd_range<2>(cl::sycl::range<2> {8,8},cl::sycl::range<2> {4,4});
        
            cgh.parallel_for<mat_mul>(myRange,[=](cl::sycl::nd_item<2> it){
               
                 //NOTE: Even if I leave this part empty I got the same error********

                //Mapping  2D to 1D: 
                int work_item_indx = it.get_local_id(1)*it.get_local_range(0) + it.get_local_id(0);
                int work_group_indx = it.get_group(1)*it.get_group_range(0) + it.get_group(0);
                int index = (it.get_local_range(0)*it.get_local_range(1))*work_group_indx + work_item_indx;


                acc_matC[index] = index;

            });
        });
    }

    //Result:
    for(int i=0; i<N*M; i++)
        std::cout<<"matC = "<<c_mat[i]<<std::endl;


    return 0;
}

I added the verbose mode to my compilation and I obtained this:

ComputeCpp>  (CE 2.5.0 2021/04/23 )
ComputeCpp>  (Assuming platform supports SPIR-V )
ComputeCpp>  (Selected Platform: NVIDIA CUDA )
ComputeCpp>  (Selected Device: GeForce GTX 1060 with Max-Q Design )
Chosen device: GeForce GTX 1060 with Max-Q Design
Max Work Group Size: 1024
ComputeCpp>  (Creating program for binary nvptx64 )
ComputeCpp> Error: [ComputeCpp:RT0100] Failed to build program
ComputeCpp> -> With OpenCL Code -9999: The Error string is not available
ComputeCpp> -> Triggered at: program_detail.cpp (581)
ComputeCpp>  (Build log for program 0x22ab840 (size: 69) ptxas fatal   : Unresolved extern function '_Z17get_global_offsetj'


 )
ComputeCpp> Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 0x22ab840 (size: 69) ptxas fatal   : Unresolved extern function '_Z17get_global_offsetj'


>


 )
ComputeCpp> -> With OpenCL Code -9999: The Error string is not available
ComputeCpp> -> Triggered at: program_detail.cpp (792)
terminate called after throwing an instance of 'cl::sycl::compile_program_error'
Aborted (core dumped)

Compilation:

compute++ -sycl-driver -sycl-target ptx64 matmul.cpp -I /usr/local/computecpp25/include/ -lOpenCL -L /usr/local/computecpp25/lib/ -lComputeCpp -std=c++17 -o matmul.exe -DSYCL_LANGUAGE_VERSION=2020 -no-serial-memop

So; I did some research about the:

ComputeCpp> -> With OpenCL Code -9999: The Error string is not available

and apparently the -9999 error comed from the use of “clEnqueueNDRangeKernel()” OpenCL function.

Juan

There is no native support of global offset in PTX and so this error is being triggered by the use of nd_range. Your only workaround would be to avoid the use of nd_range in this scenario.