Windows version is several times slower than Linux version

Hello

First of all, thank you very much for providing a Windows version of the oneAPI plugin for NVIDIA GPUs!

I’ve tested the plugin with a GeForce 1080 on both Windows 10 and WSL (on the same computer). In runs on both platforms correctly, but the performance loss on Windows can be a factor 16.

My SYCL program is a simple vector addition:

static void vectorAdditionSYCL(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	const sycl::nd_range<1> ndr(c.size(), BlockSize);	// workgroup size = BlockSize
	
	sycl::buffer aBuf(a);
	sycl::buffer bBuf(b);
	sycl::buffer cBuf(c);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

WSL output:

Vector Addition Tests

Serial on CPU in 64.83 ms

Parallel on CPU:                        100000000 in   73.74 ms, S = 0.88
The two operations produce the same results: true
number of processors = 12
number of threads    = 12

OMP on CPU:                             100000000 in   85.99 ms, S = 0.75
The two operations produce the same results: true

SYCL on NVIDIA GeForce GTX 1080
native vector width for float is   : 1
preferred vector width for float is: 1

GPU:                                    100000000 in  313.42 ms, S = 0.21
The two operations produce the same results: true

Windows output:

Vector Addition Tests

Serial on CPU in 59.6576 ms

Parallel on CPU:                        100000000 in   63.29 ms, S = 0.94
The two operations produce the same results: true
number of processors = 12
number of threads    = 12

OMP on CPU:                             100000000 in   67.45 ms, S = 0.88
The two operations produce the same results: true

SYCL on NVIDIA GeForce GTX 1080
native vector width for float is   : 1
preferred vector width for float is: 1

GPU:                                    100000000 in 5011.50 ms, S = 0.01
The two operations produce the same results: true

Do you have any ideas or hints what the problem is?

Chris

Hi @christoph.stamm,

The short answer is no, I have no idea why this might happen :sweat_smile: that being said, we’ve added a task to our board and will start by trying to reproduce your issue on one of our own machines.

Do you happen to have the full reproducer? It would be interesting to see the way you’re getting the timing, as well as things like the BlockSize and other details. Otherwise we can recreate those but it would be nice to eliminate any other sources of differences.

Many thanks,
Duncan.

#pragma once

#include <chrono>

/// <summary>
/// Stopwatch
/// Typical usages 
/// - re-usage of the same instance:  Start - Stop - GetElapsedTime ... Restart - Stop - GetElapsedTime 
/// - cummulative duration:           Start - Stop ... Start - Stop - GetElapsedTime
/// - long duration with split-times: Start - GetSplitTime - GetSplitTime - Stop - GetElapsedTime
/// - long duration with intervals  : Start - GetIntervalTime - GetIntervalTime - Stop - GetElapsedTime
/// </summary>
class Stopwatch {
	using Clock = std::chrono::high_resolution_clock;
	//using Clock = std::chrono::system_clock;

	Clock::time_point m_start;
	Clock::duration m_elapsed;
	bool m_isRunning;

public:
	Stopwatch()
		: m_elapsed{ 0 }
		, m_isRunning{ false }
	{}

	/// <summary>
	/// Start stopwatch. No effect if the stopwatch is already running.
	/// </summary>
	void Start() {
		if (!m_isRunning) {
			m_start = Clock::now();
			m_isRunning = true;
		}
	}

	/// <summary>
	/// Stop stopwatch. Updates elapsed time. No effect if the stopwatch isn't running.
	/// </summary>
	void Stop() {
		if (m_isRunning) {
			m_elapsed += Clock::now() - m_start;
			m_isRunning = false;
		}
	}

	/// <summary>
	/// Stops running stopwatch and resets elapsed time to zero.
	/// </summary>
	void Reset() {
		m_isRunning = false;
		m_elapsed = Clock::duration::zero();
	}

	/// <summary>
	/// Reset elapsed time and start stopwatch again.
	/// </summary>
	void Restart() {
		Reset();
		Start();
	}


	/// <summary>
	/// Return split time. No effect if the stopwatch isn't running.
	/// </summary>
	Clock::duration GetSplitTime() const {
		if (m_isRunning) {
			return Clock::now() - m_start;
		} else {
			return Clock::duration::zero();
		}
	}
	/// <summary>
	/// Return split time in seconds. No effect if the stopwatch isn't running.
	/// </summary>
	double GetSplitTimeSeconds() const {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetSplitTime()).count();
	}
	/// <summary>
	/// Return split time in milliseconds. No effect if the stopwatch isn't running.
	/// </summary>
	double GetSplitTimeMilliseconds() const {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetSplitTime()).count();
	}
	/// <summary>
	/// Return split time in nanoseconds. No effect if the stopwatch isn't running.
	/// </summary>
	long long GetSplitTimeNanoseconds() const {
		return std::chrono::nanoseconds(GetSplitTime()).count();
	}


	/// <summary>
	/// Return interval time. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	Clock::duration GetIntervalTime() {
		if (m_isRunning) {
			const Clock::time_point start = Clock::now();
			const Clock::duration interval = start - m_start;

			m_elapsed += interval;
			m_start = start;
			return interval;
		} else {
			return Clock::duration::zero();
		}
	}
	/// <summary>
	/// Return interval time in seconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	double GetIntervalTimeSeconds() {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetIntervalTime()).count();
	}
	/// <summary>
	/// Return interval time in milliseconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	double GetIntervalTimeMilliseconds() {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetIntervalTime()).count();
	}
	/// <summary>
	/// Return interval time in nanoseconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	long long GetIntervalTimeNanoseconds() {
		return std::chrono::nanoseconds(GetIntervalTime()).count();
	}


	/// <summary>
	/// Return elapsed time since first start after reset.
	/// </summary>
	Clock::duration GetElapsedTime() const {
		if (m_isRunning) {
			return m_elapsed + Clock::now() - m_start;
		} else {
			return m_elapsed;
		}
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in seconds.
	/// </summary>
	double GetElapsedTimeSeconds() const {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetElapsedTime()).count();
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in milliseconds.
	/// </summary>
	double GetElapsedTimeMilliseconds() const {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetElapsedTime()).count();
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in nanoseconds.
	/// </summary>
	long long GetElapsedTimeNanoseconds() const {
		return std::chrono::nanoseconds(GetElapsedTime()).count();
	}
};
#include <iostream>
#include <sstream>
#include <iomanip>
#include <algorithm>
#include <execution>
#include <vector>
#include <omp.h>
#include <random>
#include <sycl/sycl.hpp>

#include "Stopwatch.h"

using Vector = std::vector<float>;

constexpr int BlockSize = 20;

//////////////////////////////////////////////////////////////////////////////////////////////
// serial vector addition
static void vectorAddition(const Vector& a, const Vector& b, Vector& c) {
	for (size_t i = 0; i < a.size(); ++i) {
		c[i] = a[i] + b[i];
	}
}

//////////////////////////////////////////////////////////////////////////////////////////////
// parallel vector addition with OMP
static void vectorAdditionOMP(const Vector& a, const Vector& b, Vector& c) {
	#pragma omp parallel
		#pragma omp master
		{
			std::cout << "number of processors = " << omp_get_num_procs() << std::endl;
			std::cout << "number of threads    = " << omp_get_num_threads() << std::endl;
		}

	#pragma omp parallel for default(none) shared(a, b, c, std::cout)
	for (size_t i = 0; i < a.size(); ++i) {
		c[i] = a[i] + b[i];
	}
}

//////////////////////////////////////////////////////////////////////////////////////////////
// parallel vector addition with transform
static void vectorAdditionParallel(const Vector& a, const Vector& b, Vector& c) {
	std::transform(std::execution::par, a.begin(), a.end(), b.begin(), c.begin(), [](auto ai, auto bi) {
		return ai + bi;
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// GPU vector addition
static void vectorAdditionSYCL(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	const sycl::nd_range<1> ndr(c.size(), BlockSize);	// workgroup size = BlockSize
	
	sycl::buffer aBuf(a);
	sycl::buffer bBuf(b);
	sycl::buffer cBuf(c);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// GPU vector addition (vectorized)
static void vectorAdditionSYCLvec(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	using VectorType = sycl::float4;
	constexpr int VectorSize = VectorType::size();

	const sycl::range<1> r(a.size()/VectorSize);
	const sycl::nd_range<1> ndr(r, BlockSize);	// workgroup size = BlockSize

	sycl::buffer<VectorType> aBuf((VectorType*)a.data(), r);
	sycl::buffer<VectorType> bBuf((VectorType*)b.data(), r);
	sycl::buffer<VectorType> cBuf((VectorType*)c.data(), r);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// Check and print results
static void check(const char text[], const Vector& ref, const Vector& result, double ts, double tp) {
	const double S = ts/tp;

	std::cout << std::setw(40) << std::left << text << result.size();
	std::cout << " in " << std::right << std::setw(7) << std::setprecision(2) << std::fixed << tp << " ms, S = " << S << std::endl;
	std::cout << std::boolalpha << "The two operations produce the same results: " << (ref == result) << std::endl;
}

//////////////////////////////////////////////////////////////////////////////////////////////
static void reset(Vector& v) {
	v.assign(v.size(), 0);
}

//////////////////////////////////////////////////////////////////////////////////////////////
static int BackendSelector(const sycl::device& dev) {
	return dev.get_backend() == sycl::backend::ext_oneapi_cuda;
}

//////////////////////////////////////////////////////////////////////////////////////////////
int main() {
	std::cout << "Vector Addition Tests" << std::endl;

	constexpr int N = 100'000'000;

	std::default_random_engine e;
	std::uniform_real_distribution<float> dist;

	// Create an exception handler for asynchronous SYCL exceptions
	auto exception_handler = [](sycl::exception_list e_list) {
		for (std::exception_ptr const& e : e_list) {
			try {
				std::rethrow_exception(e);
			} catch (std::exception const& e) {
#if _DEBUG
				std::cout << "Failure" << std::endl;
#endif
				std::terminate();
			}
		}
	};

	Vector a(N);
	Vector b(N);
	Vector r1(N);
	Vector r2(N);
	Stopwatch sw;

	for (int i = 0; i < N; ++i) {
		a[i] = dist(e);
		b[i] = dist(e);
	};

	sw.Start();
	vectorAddition(a, b, r1);
	sw.Stop();
	const double ts = sw.GetElapsedTimeMilliseconds();
	std::cout << std::endl << "Serial on CPU in " << ts << " ms" << std::endl;

	sw.Restart();
	vectorAdditionParallel(a, b, r2);
	sw.Stop();
	std::cout << std::endl;
	check("Parallel on CPU: ", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	reset(r2);

	sw.Restart();
	vectorAdditionOMP(a, b, r2);
	sw.Stop();
	std::cout << std::endl;
	check("OMP on CPU: ", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	reset(r2);

	// GPU processing: run-time is mainly memory-transfer time

	// Get the preferred device or another one if not available
	//auto selector = sycl::default_selector_v; // The default device selector will select the most performant device.
	//auto selector = sycl::aspect_selector(sycl::aspect::cpu); // uses the CPU as the underlying OpenCL device
	//auto selector = sycl::aspect_selector(sycl::aspect::gpu);
	auto selector = BackendSelector;
	sycl::queue q(selector, exception_handler);

	std::cout << std::endl << "SYCL on " << q.get_device().get_info<sycl::info::device::name>() << std::endl;
	std::cout << "native vector width for float is   : " << q.get_device().get_info<sycl::info::device::native_vector_width_float>() << std::endl;
	std::cout << "preferred vector width for float is: " << q.get_device().get_info<sycl::info::device::preferred_vector_width_float>() << std::endl;

	try {
		sw.Restart();
		vectorAdditionSYCL(q, a, b, r2);
		q.wait(); // wait until compute tasks on GPU done
		sw.Stop();
		std::cout << std::endl;
		check("GPU:", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	} catch (const std::exception& e) {
		std::cout << "An exception is caught for vector add: " << e.what() << std::endl;
	}
	reset(r2);

	try {
		sw.Restart();
		vectorAdditionSYCLvec(q, a, b, r2);
		q.wait(); // wait until compute tasks on GPU done
		sw.Stop();
		std::cout << std::endl;
		check("GPU vectorized:", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	} catch (const std::exception& e) {
		std::cout << "An exception is caught for vector add: " << e.what() << std::endl;
	}
	reset(r2);
}

Fantastic, thank you very much!

Hi @christoph.stamm,

I’m sorry it’s taken us a while to get to this, but I’ve been able to run the reproducer you posted on a Windows 10 machine with a 4060Ti and cannot reproduce the results you saw. In short, building in VSCode with oneAPI 2025.0, I get:

SYCL on NVIDIA GeForce RTX 4060 Ti
native vector width for float is : 1
preferred vector width for float is: 1

GPU: 100000000 in 302.81 ms, S = 0.19
The two operations produce the same results: true

GPU vectorized: 100000000 in 297.66 ms, S = 0.19
The two operations produce the same results: true

Then for building with 2025.0.1 in WSL, an Ubuntu 24.04 container, I get:

SYCL on NVIDIA GeForce RTX 4060 Ti
native vector width for float is : 1
preferred vector width for float is: 1

GPU: 100000000 in 325.79 ms, S = 0.17
The two operations produce the same results: true

GPU vectorized: 100000000 in 311.84 ms, S = 0.18
The two operations produce the same results: true

So we’ve been unable to see the difference. This makes sense to me as Intel’s SYCL implementation generally doesn’t have too much platform-specific handling in it.

One thing I did see is that the first run sometimes was slower than subsequent runs, so to get these numbers I ran the program a few times. Adding host-side loops in the code and discarding the first result should have a similar effect, just in case there’s something making the first kernel execution a bit slower than it should be.

Duncan.

1 Like

I would like to add another data point to this issue as I am consistently seeing slow execution of SYCL kernels on Nvidia GPUs under Windows.

I compiled Intel’s OneAPI-samples/…/vector-add examples, adding -fsycl-targets=nvptx64-nvidia-cuda to the compile and link flags. I used the Nsight Compute profiler to measure only the kernel and not the data transfer overhead:

ncu -f -o profile vector-add-buffers.exe 100000000

When I view the profiles I see the following results for kernel timing and code size:

  • vector-add-buffers : 667ms, 1484 loc, 38 registers
  • vector-add-usm : 93ms, 368 loc, 36 registers

For comparison, an equivalent CUDA kernel (compiled with CUDA nvcc & VC++ 2022) gives:

  • vector-add-cuda : 1.74ms, 17 loc, 16 registers

The CUDA kernel is achieving around 85% of the theoretical data transfer rate, while the SYCL kernels are running more than 50 times slower, and with much higher code sizes and register use. Using buffers instead of USM causes additional overhead.

Does this give any more hints as to what the problem might be?

Simon

Software & hardware configuration:

  • Windows 11 24H2
  • OneAPI C++ Essentials Bundle (2025.0.1.28)
  • OneAPI for NVIDIA GPUs (2025.0.0)
  • CUDA Toolkit 12.5
  • NVIDIA RTX A6000 (driver version: 571.96)

Hi @sjcooke,

can you copy the compile command you have been using to run the code? We’ve not been able to replicate this internally so far so there might be some difference based on the compiler invocation.

Hi @duncan,

I’m compiling with the following:

icx-cl /EHsc /O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda src\vector-add-usm.cpp

then running from an admin console with a vector length of 10^8:

ncu -f -o profile vector-add-usm.exe 100000000

Here is an example of the graphical view:

Thank you very much. We will reopen this issue and investigate further next week.

Duncan.

Hi @christoph.stamm and @sjcooke,
we found an issue with the icx-cl and icx compiler drivers on Windows. They don’t seem to be applying optimisations in device code compilation for the CUDA backend. We originally couldn’t reproduce the problem because we were trying only with the icpx driver.

Could you try compiling your applications again, but with the icpx compiler driver? Let us know if this results in the expected performance.

1 Like

Hi @rbielski
I have experienced the original issue as well. I would like to try compiling my program with icpx, but it is not obvious how to compile with icpx using CMake on Windows.
CMake tries to use MSVC compatible flags, like /nologo, so simply changing CMAKE_CXX_COMPILER=icpx is not sufficient. Is there some guide to getting icpx to work with CMake on Windows?

Hi @rbielski,

Thank you for diagnosing the issue. Using icpx I now have the following:

  • vector-add-buffers : 1.75ms, 32 loc, 16 registers
  • vector-add-usm : 33ms, 20 loc, 16 registers

So vector-add-buffers is achieving performance equal to the reference CUDA benchmark.

For vector-add-usm the code size & register use is greatly reduced, but the execution time is reduced only by a factor of 3, still around 20 times slower than the reference.

I next tried changing the USM code to use malloc_device instead of malloc_shared and found that the performance issue is resolved:

  • vector-add-usm : 1.74ms, 20 loc, 16 registers (using malloc_device)

It looks like this is a separate issue with malloc_shared, and the original issue can be resolved as you determined.

For reference, I am compiling now using the following:

icpx -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda src\vector-add-usm.cpp -o vector-add-usm.exe

Hi @sjcooke,
many thanks for testing! I’m glad icpx provides you with good performance and can be used as a workaround while we’re working on fixing icx-cl / icx.

The performance difference between malloc_device and malloc_shared looks correct to me. Direct and managed allocations are very different solutions with different performance implications. One also has to be very careful comparing different programming models like CUDA and SYCL and make sure the compared concepts are equivalent. In this case, sycl::malloc_shared is an equivalent of cudaMallocManaged so it cannot be compared to a program using cudaMalloc.

What happens with shared memory (managed memory in CUDA terminology) is that the host to device memory transfer is part of the kernel execution, since it’s triggered lazily on access to the data inside the kernel. Therefore, your vector-add-usm profile is actually measuring the time of memory transfer + computation, not just the computation.

Hi @AytonDew,
could you try with -DCMAKE_SYSTEM_NAME=Linux? This workaround probably has some limitations, but it worked for me with the oneAPI-Samples vector-add after commenting out the CMAKE_FORCE_CXX_COMPILER(icx-cl) here:

I also had to rename the resulting binary from vector-add-buffers to vector-add-buffers.exe after building, but then it seems to run fine.

My commands were:

cmake -Bbuild -DCMAKE_CXX_COMPILER=icpx -G "NMake Makefiles" -DCMAKE_SYSTEM_NAME=Linux
cmake --build build
cd build
ren vector-add-buffers vector-add-buffers.exe
vector-add-buffers.exe

Hi @rbielski,

Thanks for clarifying. icpx is now working well for my application and I look forward to the future release.

Hi @rbielski

I’ve compiled my programs with icpx on Windows and now the performance is as expected.
I hope you can also find a solution for icx-cl and icx.

Thanks
Chris