Windows version is several times slower than Linux version

Hello

First of all, thank you very much for providing a Windows version of the oneAPI plugin for NVIDIA GPUs!

I’ve tested the plugin with a GeForce 1080 on both Windows 10 and WSL (on the same computer). In runs on both platforms correctly, but the performance loss on Windows can be a factor 16.

My SYCL program is a simple vector addition:

static void vectorAdditionSYCL(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	const sycl::nd_range<1> ndr(c.size(), BlockSize);	// workgroup size = BlockSize
	
	sycl::buffer aBuf(a);
	sycl::buffer bBuf(b);
	sycl::buffer cBuf(c);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

WSL output:

Vector Addition Tests

Serial on CPU in 64.83 ms

Parallel on CPU:                        100000000 in   73.74 ms, S = 0.88
The two operations produce the same results: true
number of processors = 12
number of threads    = 12

OMP on CPU:                             100000000 in   85.99 ms, S = 0.75
The two operations produce the same results: true

SYCL on NVIDIA GeForce GTX 1080
native vector width for float is   : 1
preferred vector width for float is: 1

GPU:                                    100000000 in  313.42 ms, S = 0.21
The two operations produce the same results: true

Windows output:

Vector Addition Tests

Serial on CPU in 59.6576 ms

Parallel on CPU:                        100000000 in   63.29 ms, S = 0.94
The two operations produce the same results: true
number of processors = 12
number of threads    = 12

OMP on CPU:                             100000000 in   67.45 ms, S = 0.88
The two operations produce the same results: true

SYCL on NVIDIA GeForce GTX 1080
native vector width for float is   : 1
preferred vector width for float is: 1

GPU:                                    100000000 in 5011.50 ms, S = 0.01
The two operations produce the same results: true

Do you have any ideas or hints what the problem is?

Chris

Hi @christoph.stamm,

The short answer is no, I have no idea why this might happen :sweat_smile: that being said, we’ve added a task to our board and will start by trying to reproduce your issue on one of our own machines.

Do you happen to have the full reproducer? It would be interesting to see the way you’re getting the timing, as well as things like the BlockSize and other details. Otherwise we can recreate those but it would be nice to eliminate any other sources of differences.

Many thanks,
Duncan.

#pragma once

#include <chrono>

/// <summary>
/// Stopwatch
/// Typical usages 
/// - re-usage of the same instance:  Start - Stop - GetElapsedTime ... Restart - Stop - GetElapsedTime 
/// - cummulative duration:           Start - Stop ... Start - Stop - GetElapsedTime
/// - long duration with split-times: Start - GetSplitTime - GetSplitTime - Stop - GetElapsedTime
/// - long duration with intervals  : Start - GetIntervalTime - GetIntervalTime - Stop - GetElapsedTime
/// </summary>
class Stopwatch {
	using Clock = std::chrono::high_resolution_clock;
	//using Clock = std::chrono::system_clock;

	Clock::time_point m_start;
	Clock::duration m_elapsed;
	bool m_isRunning;

public:
	Stopwatch()
		: m_elapsed{ 0 }
		, m_isRunning{ false }
	{}

	/// <summary>
	/// Start stopwatch. No effect if the stopwatch is already running.
	/// </summary>
	void Start() {
		if (!m_isRunning) {
			m_start = Clock::now();
			m_isRunning = true;
		}
	}

	/// <summary>
	/// Stop stopwatch. Updates elapsed time. No effect if the stopwatch isn't running.
	/// </summary>
	void Stop() {
		if (m_isRunning) {
			m_elapsed += Clock::now() - m_start;
			m_isRunning = false;
		}
	}

	/// <summary>
	/// Stops running stopwatch and resets elapsed time to zero.
	/// </summary>
	void Reset() {
		m_isRunning = false;
		m_elapsed = Clock::duration::zero();
	}

	/// <summary>
	/// Reset elapsed time and start stopwatch again.
	/// </summary>
	void Restart() {
		Reset();
		Start();
	}


	/// <summary>
	/// Return split time. No effect if the stopwatch isn't running.
	/// </summary>
	Clock::duration GetSplitTime() const {
		if (m_isRunning) {
			return Clock::now() - m_start;
		} else {
			return Clock::duration::zero();
		}
	}
	/// <summary>
	/// Return split time in seconds. No effect if the stopwatch isn't running.
	/// </summary>
	double GetSplitTimeSeconds() const {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetSplitTime()).count();
	}
	/// <summary>
	/// Return split time in milliseconds. No effect if the stopwatch isn't running.
	/// </summary>
	double GetSplitTimeMilliseconds() const {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetSplitTime()).count();
	}
	/// <summary>
	/// Return split time in nanoseconds. No effect if the stopwatch isn't running.
	/// </summary>
	long long GetSplitTimeNanoseconds() const {
		return std::chrono::nanoseconds(GetSplitTime()).count();
	}


	/// <summary>
	/// Return interval time. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	Clock::duration GetIntervalTime() {
		if (m_isRunning) {
			const Clock::time_point start = Clock::now();
			const Clock::duration interval = start - m_start;

			m_elapsed += interval;
			m_start = start;
			return interval;
		} else {
			return Clock::duration::zero();
		}
	}
	/// <summary>
	/// Return interval time in seconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	double GetIntervalTimeSeconds() {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetIntervalTime()).count();
	}
	/// <summary>
	/// Return interval time in milliseconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	double GetIntervalTimeMilliseconds() {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetIntervalTime()).count();
	}
	/// <summary>
	/// Return interval time in nanoseconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	long long GetIntervalTimeNanoseconds() {
		return std::chrono::nanoseconds(GetIntervalTime()).count();
	}


	/// <summary>
	/// Return elapsed time since first start after reset.
	/// </summary>
	Clock::duration GetElapsedTime() const {
		if (m_isRunning) {
			return m_elapsed + Clock::now() - m_start;
		} else {
			return m_elapsed;
		}
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in seconds.
	/// </summary>
	double GetElapsedTimeSeconds() const {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetElapsedTime()).count();
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in milliseconds.
	/// </summary>
	double GetElapsedTimeMilliseconds() const {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetElapsedTime()).count();
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in nanoseconds.
	/// </summary>
	long long GetElapsedTimeNanoseconds() const {
		return std::chrono::nanoseconds(GetElapsedTime()).count();
	}
};
#include <iostream>
#include <sstream>
#include <iomanip>
#include <algorithm>
#include <execution>
#include <vector>
#include <omp.h>
#include <random>
#include <sycl/sycl.hpp>

#include "Stopwatch.h"

using Vector = std::vector<float>;

constexpr int BlockSize = 20;

//////////////////////////////////////////////////////////////////////////////////////////////
// serial vector addition
static void vectorAddition(const Vector& a, const Vector& b, Vector& c) {
	for (size_t i = 0; i < a.size(); ++i) {
		c[i] = a[i] + b[i];
	}
}

//////////////////////////////////////////////////////////////////////////////////////////////
// parallel vector addition with OMP
static void vectorAdditionOMP(const Vector& a, const Vector& b, Vector& c) {
	#pragma omp parallel
		#pragma omp master
		{
			std::cout << "number of processors = " << omp_get_num_procs() << std::endl;
			std::cout << "number of threads    = " << omp_get_num_threads() << std::endl;
		}

	#pragma omp parallel for default(none) shared(a, b, c, std::cout)
	for (size_t i = 0; i < a.size(); ++i) {
		c[i] = a[i] + b[i];
	}
}

//////////////////////////////////////////////////////////////////////////////////////////////
// parallel vector addition with transform
static void vectorAdditionParallel(const Vector& a, const Vector& b, Vector& c) {
	std::transform(std::execution::par, a.begin(), a.end(), b.begin(), c.begin(), [](auto ai, auto bi) {
		return ai + bi;
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// GPU vector addition
static void vectorAdditionSYCL(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	const sycl::nd_range<1> ndr(c.size(), BlockSize);	// workgroup size = BlockSize
	
	sycl::buffer aBuf(a);
	sycl::buffer bBuf(b);
	sycl::buffer cBuf(c);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// GPU vector addition (vectorized)
static void vectorAdditionSYCLvec(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	using VectorType = sycl::float4;
	constexpr int VectorSize = VectorType::size();

	const sycl::range<1> r(a.size()/VectorSize);
	const sycl::nd_range<1> ndr(r, BlockSize);	// workgroup size = BlockSize

	sycl::buffer<VectorType> aBuf((VectorType*)a.data(), r);
	sycl::buffer<VectorType> bBuf((VectorType*)b.data(), r);
	sycl::buffer<VectorType> cBuf((VectorType*)c.data(), r);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// Check and print results
static void check(const char text[], const Vector& ref, const Vector& result, double ts, double tp) {
	const double S = ts/tp;

	std::cout << std::setw(40) << std::left << text << result.size();
	std::cout << " in " << std::right << std::setw(7) << std::setprecision(2) << std::fixed << tp << " ms, S = " << S << std::endl;
	std::cout << std::boolalpha << "The two operations produce the same results: " << (ref == result) << std::endl;
}

//////////////////////////////////////////////////////////////////////////////////////////////
static void reset(Vector& v) {
	v.assign(v.size(), 0);
}

//////////////////////////////////////////////////////////////////////////////////////////////
static int BackendSelector(const sycl::device& dev) {
	return dev.get_backend() == sycl::backend::ext_oneapi_cuda;
}

//////////////////////////////////////////////////////////////////////////////////////////////
int main() {
	std::cout << "Vector Addition Tests" << std::endl;

	constexpr int N = 100'000'000;

	std::default_random_engine e;
	std::uniform_real_distribution<float> dist;

	// Create an exception handler for asynchronous SYCL exceptions
	auto exception_handler = [](sycl::exception_list e_list) {
		for (std::exception_ptr const& e : e_list) {
			try {
				std::rethrow_exception(e);
			} catch (std::exception const& e) {
#if _DEBUG
				std::cout << "Failure" << std::endl;
#endif
				std::terminate();
			}
		}
	};

	Vector a(N);
	Vector b(N);
	Vector r1(N);
	Vector r2(N);
	Stopwatch sw;

	for (int i = 0; i < N; ++i) {
		a[i] = dist(e);
		b[i] = dist(e);
	};

	sw.Start();
	vectorAddition(a, b, r1);
	sw.Stop();
	const double ts = sw.GetElapsedTimeMilliseconds();
	std::cout << std::endl << "Serial on CPU in " << ts << " ms" << std::endl;

	sw.Restart();
	vectorAdditionParallel(a, b, r2);
	sw.Stop();
	std::cout << std::endl;
	check("Parallel on CPU: ", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	reset(r2);

	sw.Restart();
	vectorAdditionOMP(a, b, r2);
	sw.Stop();
	std::cout << std::endl;
	check("OMP on CPU: ", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	reset(r2);

	// GPU processing: run-time is mainly memory-transfer time

	// Get the preferred device or another one if not available
	//auto selector = sycl::default_selector_v; // The default device selector will select the most performant device.
	//auto selector = sycl::aspect_selector(sycl::aspect::cpu); // uses the CPU as the underlying OpenCL device
	//auto selector = sycl::aspect_selector(sycl::aspect::gpu);
	auto selector = BackendSelector;
	sycl::queue q(selector, exception_handler);

	std::cout << std::endl << "SYCL on " << q.get_device().get_info<sycl::info::device::name>() << std::endl;
	std::cout << "native vector width for float is   : " << q.get_device().get_info<sycl::info::device::native_vector_width_float>() << std::endl;
	std::cout << "preferred vector width for float is: " << q.get_device().get_info<sycl::info::device::preferred_vector_width_float>() << std::endl;

	try {
		sw.Restart();
		vectorAdditionSYCL(q, a, b, r2);
		q.wait(); // wait until compute tasks on GPU done
		sw.Stop();
		std::cout << std::endl;
		check("GPU:", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	} catch (const std::exception& e) {
		std::cout << "An exception is caught for vector add: " << e.what() << std::endl;
	}
	reset(r2);

	try {
		sw.Restart();
		vectorAdditionSYCLvec(q, a, b, r2);
		q.wait(); // wait until compute tasks on GPU done
		sw.Stop();
		std::cout << std::endl;
		check("GPU vectorized:", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	} catch (const std::exception& e) {
		std::cout << "An exception is caught for vector add: " << e.what() << std::endl;
	}
	reset(r2);
}

Fantastic, thank you very much!