Working example on Windows 10 + Nvidia

janwilmans · 5 April 2019 11:18

I tried to make a basic comparison application to see the different speed in computation of vector add and multiply on CPU vs Nvidia GPU.

This is what I have, I take 2x 250 * 1024 * 204 input numbers, add them and multiply by 0.3f and write the result.

I tried this on Intel build-in GPU, and on Nvidia using the compute++ -sycl -sycl-target ptx64 option.
However results confuse me, on all targets,

first of all on the CPU (using cpu_selector), it sometimes runs, in 3-10s seconds (hot/cold) and sometimes, reports:

  Run on CPU selection
  Vector adding 256000k elements
  Error: [ComputeCpp:RT0107] Failed to create program from binary
  Time: 1288ms

using the gpu_selector, it should run on the P600 (see below), but it always reports:

  error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 000002A1FAD58B60 device 0 (size 65):
  error   : Binary format for key='0', ident='' is not recognized

Even though I added : -sycl -sycl-target ptx64

Is there any example for windows that actually works out of the box?

-----
SYCL 1.2.1 revision 3

********************************************************************************


Device Info:

Discovered 3 devices matching:
platform    : <any>
device type : <any>
--------------------------------------------------------------------------------
Device 0:

Device is supported                     : NO - Device does not support SPIR
CL_DEVICE_NAME                          : Quadro P600
CL_DEVICE_VENDOR                        : NVIDIA Corporation
CL_DRIVER_VERSION                       : 416.78
CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 1:

  Device is supported                     : UNTESTED - Device not tested on this OS
  CL_DEVICE_NAME                          : Intel(R) UHD Graphics 630
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       : 24.20.100.6287
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU
--------------------------------------------------------------------------------
Device 2:

  Device is supported                     : UNTESTED - Device running untested driver
  CL_DEVICE_NAME                          : Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       : 7.6.0.716
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_CPU

If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v1.1.0/platform-support-notes

********************************************************************************

rod · 5 April 2019 14:26

I’m unsure why you are seeing these issues with your devices. What I would recommend is that you print out the device being used to validate it is the one you are expecting. There’s some code that shows how to do this in the sample code here

For the issue you are seeing with the NVidia GPU can you open the .sycl file that is generated during the build stage and check that it shows the correct instruction set.
At around line 34 you’ll see something like this:

unsigned char SYCL_hello_world_cpp_bin_spir64[] = {

This will show if it has been built for spir32, spir64 or ptx.

Let me know what you find out from there.

janwilmans · 5 April 2019 16:38

Thanks! That was exactly what I was looking for!
It turns our really weird things happen if you create more then one queue, even if the first queue leaves scope before you make the second, it seems to mess up maybe the device selection, or other weirdness, I’m getting inconsistent error messages.

I it now works fine on:

Intel GPU
Native CPU
Host device

But still not on Nvidea, I checked the sycl file, and it does contains:
unsigned char SYCL_main_cpp_bin_nvptx64[] = {…

However, running it on Nvidia result in:

D:\>"C:\Users\Jan Wilmans\source\repos\ComputeCpp SYCL C++1\x64\Release\ComputeCpp SYCL         C++1.exe" 0
Allocate memory...
0) gpu: Quadro P600
Running on Quadro P600
Error: [ComputeCpp:RT0100] Failed to build program (<Build log for program 000001A8FDAB0F80     device 0 (size 65):
error   : Binary format for key='0', ident='' is not recognized>)

This is a very generic error message I guess, so no idea why that is happening… however, no matter if I pass -sycl-target ptx64 or not the error message is the same, so I think its erroring out before it gets to the pointer of actually executing anything on the GPU.

Greetings,

Jan

rod · 5 April 2019 16:43

That should work. Do you have a code snippet that demonstrates how you are setting up the queues and device selectors?

janwilmans · 5 April 2019 16:56

yes, here it is:

#include <CL/sycl.hpp>
#include <iostream>
#include <chrono>
#include <string>

using namespace std::chrono_literals;
using namespace cl::sycl;

static const int dataSize = 256 * 1024;

struct InData
{
	float dataA[dataSize];
	float dataB[dataSize];
};

struct OutData
{
	float data[dataSize];
};

void TestVectorAdd(queue& myQueue, InData& in, OutData& out)
{
	try 
	{
		buffer<float, 1> inputBufferA(in.dataA, range<1>(dataSize));
		buffer<float, 1> inputBufferB(in.dataB, range<1>(dataSize));
		buffer<float, 1> outputBuffer(out.data, range<1>(dataSize));

		myQueue.submit([&](handler &cgh) {

			auto inputPtrA = inputBufferA.get_access<access::mode::read>(cgh);
			auto inputPtrB = inputBufferB.get_access<access::mode::read>(cgh);
			auto outputPtr = outputBuffer.get_access<access::mode::write>(cgh);

			cgh.parallel_for<class vector_add>(
				nd_range<3>(range<3>(16, 8, 8), range<3>(4, 2, 2)),
				[=](nd_item<3> item) {

					size_t idx = item.get_global_linear_id();
					outputPtr[idx] = cl::sycl::exp((cl::sycl::exp(inputPtrA[idx]) / cl::sycl::exp(0.7f)) + (cl::sycl::exp(inputPtrB[idx]) / cl::sycl::exp(0.4f)));
					outputPtr[1] = cl::sycl::exp((inputPtrA[idx] / 0.7f) + (inputPtrB[idx] / 0.4f));
				});
		});
	}
	catch (const exception& e)
	{
		std::cout << e.what() << "\n";
		throw;
	}
}

std::string to_string(info::device_type d)
{
	switch (d)
	{
	case info::device_type::cpu:
		return "cpu";
	case info::device_type::gpu:
		return "gpu";
	case info::device_type::accelerator:
		return "accelerator";
	case info::device_type::custom:
		return "custom";
	case info::device_type::host:
		return "host";
	default:
		return "<not_implemented>";
	}
}

int selectedId = 0;

class custom_selector : public device_selector {
public:
	custom_selector() : device_selector() {}

	int operator()(const device& device) const override {

		if (count == selectedId)
		{
			std::cout << count << ") " << to_string(device.get_info<info::device::device_type>());
			std::cout << ": " << device.get_info<info::device::name>() << "\n";
			++count;
			return 100;
		}
		++count;
		return 50 + count;
	}
	mutable int count = 0;
};

void testCustom()
{
	custom_selector selector;
	queue myQueue(selector);
	myQueue.submit([&](handler &cgh) {});
}


// -sycl-target ptx64

int main(int argc, char *argv[]) {
	
	if (argc > 1)
	{
		selectedId = std::stoi(argv[1]);
	}

	std::cout << "Allocate memory...\n";

	auto inputDynamic = std::make_unique<InData>();
	auto outputDynamic = std::make_unique<OutData>();
	InData& input = *inputDynamic;
	OutData& output = *outputDynamic;

	/* Initialize input data with values and output data with zeroes. */
	for (int i = 0; i < dataSize; i++) {
		input.dataA[i] = (float)i;
		input.dataB[i] = (float)(1024 - (i % 2014));
		output.data[i] = 0.0f;
	}

	custom_selector selector;

	{
		queue myQueue2(selector);		// this will  mess up the behaviour, commenting makes it work correctly
	}

	queue myQueue(selector);
	std::cout << "Running on " << myQueue.get_device().get_info<cl::sycl::info::device::name>() << "\n";

	auto tcpu1 = std::chrono::system_clock::now();

	for (int i = 0; i < 10000; ++i)
		TestVectorAdd(myQueue, input, output);
	auto tcpu2 = std::chrono::system_clock::now();
	std::cout << "\nTime: " << std::chrono::duration_cast<std::chrono::milliseconds>(tcpu2 - tcpu1).count() << "ms\n";
}

janwilmans · 6 April 2019 09:49

ok, I figured out why this second queue was behaving different: because I was messing it up with my hacky custom_selector. I would like to know what the best way is to list all devices and select exactly one from the command line?

rod · 8 April 2019 15:44

Something like this should do the job:

std::vector<device> devices;
// Get list of platforms
std::vector<platform> platforms = platform::get_platforms();
// Enumerate devices
for (unsigned int i = 0; i < platforms.size(); i++)
{
    std::vector<device> plat_devices = platforms[i].get_devices();
    devices.insert(devices.end(), plat_devices.begin(), plat_devices.end());
}

Then parse the int from the command line and get the device from devices array.

janwilmans · 8 April 2019 18:46

That was what I was looking for! great help, thanks!

janwilmans · 8 April 2019 21:52

Weird thing: in the debugger I don’t see the CPU device, while running outside the debugger, I do see it?

Device 0:

  Device is supported                     : UNTESTED - Device running untested driver
  CL_DEVICE_NAME                          : Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
  CL_DEVICE_VENDOR                        : Intel(R) Corporation
  CL_DRIVER_VERSION                       : 18.1.0.0920
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_CPU

rod · 9 April 2019 08:26

I’m not seeing the same behaviour. I’m invoking the debug version through Visual Studio and when I use:

for (auto dev : cl::sycl::device::get_devices())
    std::cout << dev.get_info<cl::sycl::info::device::name>();

It prints out all the devices including the CPU.

janwilmans · 9 April 2019 09:39

Testing on a different machine at work, I don’t see it either… I will try to narrow it down when I get back home, at least now I know its not the expected behavior…

Meteorhead · 19 April 2019 06:44

I have come across the issue when using CMake, that default CMake behavior is to use “RelWithDebInfo” configuration when requesting Release builds. However, this triggers Debug layout for STL types, but the API types coming out of ComputeCpp are proper Debug/Release types. Trying to query for a std::vectorcl::sycl::XXX with wrong layout can result in criptic behavior. Make sure your Debug build is using the Debug ComputeCpp libraries, and Release using the Release ones.

Topic		Replies	Views
Windows 10/INvidia device issues SYCL development	3	1090	24 April 2019
Windows version is several times slower than Linux version oneAPI for NVIDIA GPUs	15	184	23 April 2025
Single binary with -sycl-target ptx64 and host cpu?	2	949	28 May 2019
Poor performance on matrix multiplication oneAPI for NVIDIA GPUs	8	1108	7 November 2024
Questions about supported SYCL targets SYCL development	2	1222	16 October 2020

Working example on Windows 10 + Nvidia

Related topics