Working example on Windows 10 + Nvidia

I tried to make a basic comparison application to see the different speed in computation of vector add and multiply on CPU vs Nvidia GPU.

This is what I have, I take 2x 250 * 1024 * 204 input numbers, add them and multiply by 0.3f and write the result.

I tried this on Intel build-in GPU, and on Nvidia using the compute++ -sycl -sycl-target ptx64 option.
However results confuse me, on all targets,

  • first of all on the CPU (using cpu_selector), it sometimes runs, in 3-10s seconds (hot/cold) and sometimes, reports:

      Run on CPU selection
      Vector adding 256000k elements
      Error: [ComputeCpp:RT0107] Failed to create program from binary
      Time: 1288ms
  • using the gpu_selector, it should run on the P600 (see below), but it always reports:

      error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 000002A1FAD58B60 device 0 (size 65):
      error   : Binary format for key='0', ident='' is not recognized

Even though I added : -sycl -sycl-target ptx64

Is there any example for windows that actually works out of the box?

SYCL 1.2.1 revision 3


Device Info:

Discovered 3 devices matching:
platform    : <any>
device type : <any>
Device 0:

Device is supported                     : NO - Device does not support SPIR
CL_DEVICE_NAME                          : Quadro P600
CL_DEVICE_VENDOR                        : NVIDIA Corporation
CL_DRIVER_VERSION                       : 416.78
CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
Device 1:

  Device is supported                     : UNTESTED - Device not tested on this OS
  CL_DEVICE_NAME                          : Intel(R) UHD Graphics 630
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       :
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
Device 2:

  Device is supported                     : UNTESTED - Device running untested driver
  CL_DEVICE_NAME                          : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       :
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_CPU

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:


I’m unsure why you are seeing these issues with your devices. What I would recommend is that you print out the device being used to validate it is the one you are expecting. There’s some code that shows how to do this in the sample code here

For the issue you are seeing with the NVidia GPU can you open the .sycl file that is generated during the build stage and check that it shows the correct instruction set.
At around line 34 you’ll see something like this:

unsigned char SYCL_hello_world_cpp_bin_spir64[] = {

This will show if it has been built for spir32, spir64 or ptx.

Let me know what you find out from there.

Thanks! That was exactly what I was looking for!
It turns our really weird things happen if you create more then one queue, even if the first queue leaves scope before you make the second, it seems to mess up maybe the device selection, or other weirdness, I’m getting inconsistent error messages.

I it now works fine on:

  • Intel GPU
  • Native CPU
  • Host device

But still not on Nvidea, I checked the sycl file, and it does contains:
unsigned char SYCL_main_cpp_bin_nvptx64[] = {…

However, running it on Nvidia result in:

D:\>"C:\Users\Jan Wilmans\source\repos\ComputeCpp SYCL C++1\x64\Release\ComputeCpp SYCL         C++1.exe" 0
Allocate memory...
0) gpu: Quadro P600
Running on Quadro P600
Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 000001A8FDAB0F80     device 0 (size 65):
error   : Binary format for key='0', ident='' is not recognized>)

This is a very generic error message I guess, so no idea why that is happening… however, no matter if I pass -sycl-target ptx64 or not the error message is the same, so I think its erroring out before it gets to the pointer of actually executing anything on the GPU.



That should work. Do you have a code snippet that demonstrates how you are setting up the queues and device selectors?

yes, here it is:

#include <CL/sycl.hpp>
#include <iostream>
#include <chrono>
#include <string>

using namespace std::chrono_literals;
using namespace cl::sycl;

static const int dataSize = 256 * 1024;

struct InData
	float dataA[dataSize];
	float dataB[dataSize];

struct OutData
	float data[dataSize];

void TestVectorAdd(queue& myQueue, InData& in, OutData& out)
		buffer<float, 1> inputBufferA(in.dataA, range<1>(dataSize));
		buffer<float, 1> inputBufferB(in.dataB, range<1>(dataSize));
		buffer<float, 1> outputBuffer(, range<1>(dataSize));

		myQueue.submit([&](handler &cgh) {

			auto inputPtrA = inputBufferA.get_access<access::mode::read>(cgh);
			auto inputPtrB = inputBufferB.get_access<access::mode::read>(cgh);
			auto outputPtr = outputBuffer.get_access<access::mode::write>(cgh);

			cgh.parallel_for<class vector_add>(
				nd_range<3>(range<3>(16, 8, 8), range<3>(4, 2, 2)),
				[=](nd_item<3> item) {

					size_t idx = item.get_global_linear_id();
					outputPtr[idx] = cl::sycl::exp((cl::sycl::exp(inputPtrA[idx]) / cl::sycl::exp(0.7f)) + (cl::sycl::exp(inputPtrB[idx]) / cl::sycl::exp(0.4f)));
					outputPtr[1] = cl::sycl::exp((inputPtrA[idx] / 0.7f) + (inputPtrB[idx] / 0.4f));
	catch (const exception& e)
		std::cout << e.what() << "\n";

std::string to_string(info::device_type d)
	switch (d)
	case info::device_type::cpu:
		return "cpu";
	case info::device_type::gpu:
		return "gpu";
	case info::device_type::accelerator:
		return "accelerator";
	case info::device_type::custom:
		return "custom";
	case info::device_type::host:
		return "host";
		return "<not_implemented>";

int selectedId = 0;

class custom_selector : public device_selector {
	custom_selector() : device_selector() {}

	int operator()(const device& device) const override {

		if (count == selectedId)
			std::cout << count << ") " << to_string(device.get_info<info::device::device_type>());
			std::cout << ": " << device.get_info<info::device::name>() << "\n";
			return 100;
		return 50 + count;
	mutable int count = 0;

void testCustom()
	custom_selector selector;
	queue myQueue(selector);
	myQueue.submit([&](handler &cgh) {});

// -sycl-target ptx64

int main(int argc, char *argv[]) {
	if (argc > 1)
		selectedId = std::stoi(argv[1]);

	std::cout << "Allocate memory...\n";

	auto inputDynamic = std::make_unique<InData>();
	auto outputDynamic = std::make_unique<OutData>();
	InData& input = *inputDynamic;
	OutData& output = *outputDynamic;

	/* Initialize input data with values and output data with zeroes. */
	for (int i = 0; i < dataSize; i++) {
		input.dataA[i] = (float)i;
		input.dataB[i] = (float)(1024 - (i % 2014));[i] = 0.0f;

	custom_selector selector;

		queue myQueue2(selector);		// this will  mess up the behaviour, commenting makes it work correctly

	queue myQueue(selector);
	std::cout << "Running on " << myQueue.get_device().get_info<cl::sycl::info::device::name>() << "\n";

	auto tcpu1 = std::chrono::system_clock::now();

	for (int i = 0; i < 10000; ++i)
		TestVectorAdd(myQueue, input, output);
	auto tcpu2 = std::chrono::system_clock::now();
	std::cout << "\nTime: " << std::chrono::duration_cast<std::chrono::milliseconds>(tcpu2 - tcpu1).count() << "ms\n";

ok, I figured out why this second queue was behaving different: because I was messing it up with my hacky custom_selector. I would like to know what the best way is to list all devices and select exactly one from the command line?

Something like this should do the job:

std::vector<device> devices;
// Get list of platforms
std::vector<platform> platforms = platform::get_platforms();
// Enumerate devices
for (unsigned int i = 0; i < platforms.size(); i++)
    std::vector<device> plat_devices = platforms[i].get_devices();
    devices.insert(devices.end(), plat_devices.begin(), plat_devices.end());

Then parse the int from the command line and get the device from devices array.

1 Like

That was what I was looking for! great help, thanks!

Weird thing: in the debugger I don’t see the CPU device, while running outside the debugger, I do see it?

Device 0:

  Device is supported                     : UNTESTED - Device running untested driver
  CL_DEVICE_NAME                          : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       :
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_CPU

I’m not seeing the same behaviour. I’m invoking the debug version through Visual Studio and when I use:

for (auto dev : cl::sycl::device::get_devices())
    std::cout << dev.get_info<cl::sycl::info::device::name>();

It prints out all the devices including the CPU.

Testing on a different machine at work, I don’t see it either… I will try to narrow it down when I get back home, at least now I know its not the expected behavior…

I have come across the issue when using CMake, that default CMake behavior is to use “RelWithDebInfo” configuration when requesting Release builds. However, this triggers Debug layout for STL types, but the API types coming out of ComputeCpp are proper Debug/Release types. Trying to query for a std::vectorcl::sycl::XXX with wrong layout can result in criptic behavior. Make sure your Debug build is using the Debug ComputeCpp libraries, and Release using the Release ones.

1 Like