Windows version is several times slower than Linux version

christoph.stamm · 27 January 2025 17:25

Hello

First of all, thank you very much for providing a Windows version of the oneAPI plugin for NVIDIA GPUs!

I’ve tested the plugin with a GeForce 1080 on both Windows 10 and WSL (on the same computer). In runs on both platforms correctly, but the performance loss on Windows can be a factor 16.

My SYCL program is a simple vector addition:

static void vectorAdditionSYCL(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	const sycl::nd_range<1> ndr(c.size(), BlockSize);	// workgroup size = BlockSize
	
	sycl::buffer aBuf(a);
	sycl::buffer bBuf(b);
	sycl::buffer cBuf(c);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

WSL output:

Vector Addition Tests

Serial on CPU in 64.83 ms

Parallel on CPU:                        100000000 in   73.74 ms, S = 0.88
The two operations produce the same results: true
number of processors = 12
number of threads    = 12

OMP on CPU:                             100000000 in   85.99 ms, S = 0.75
The two operations produce the same results: true

SYCL on NVIDIA GeForce GTX 1080
native vector width for float is   : 1
preferred vector width for float is: 1

GPU:                                    100000000 in  313.42 ms, S = 0.21
The two operations produce the same results: true

Windows output:

Vector Addition Tests

Serial on CPU in 59.6576 ms

Parallel on CPU:                        100000000 in   63.29 ms, S = 0.94
The two operations produce the same results: true
number of processors = 12
number of threads    = 12

OMP on CPU:                             100000000 in   67.45 ms, S = 0.88
The two operations produce the same results: true

SYCL on NVIDIA GeForce GTX 1080
native vector width for float is   : 1
preferred vector width for float is: 1

GPU:                                    100000000 in 5011.50 ms, S = 0.01
The two operations produce the same results: true

Do you have any ideas or hints what the problem is?

Chris

duncan · 28 January 2025 15:06

Hi @christoph.stamm,

The short answer is no, I have no idea why this might happen that being said, we’ve added a task to our board and will start by trying to reproduce your issue on one of our own machines.

Do you happen to have the full reproducer? It would be interesting to see the way you’re getting the timing, as well as things like the BlockSize and other details. Otherwise we can recreate those but it would be nice to eliminate any other sources of differences.

Many thanks,
Duncan.

christoph.stamm · 28 January 2025 15:29

#pragma once

#include <chrono>

/// <summary>
/// Stopwatch
/// Typical usages 
/// - re-usage of the same instance:  Start - Stop - GetElapsedTime ... Restart - Stop - GetElapsedTime 
/// - cummulative duration:           Start - Stop ... Start - Stop - GetElapsedTime
/// - long duration with split-times: Start - GetSplitTime - GetSplitTime - Stop - GetElapsedTime
/// - long duration with intervals  : Start - GetIntervalTime - GetIntervalTime - Stop - GetElapsedTime
/// </summary>
class Stopwatch {
	using Clock = std::chrono::high_resolution_clock;
	//using Clock = std::chrono::system_clock;

	Clock::time_point m_start;
	Clock::duration m_elapsed;
	bool m_isRunning;

public:
	Stopwatch()
		: m_elapsed{ 0 }
		, m_isRunning{ false }
	{}

	/// <summary>
	/// Start stopwatch. No effect if the stopwatch is already running.
	/// </summary>
	void Start() {
		if (!m_isRunning) {
			m_start = Clock::now();
			m_isRunning = true;
		}
	}

	/// <summary>
	/// Stop stopwatch. Updates elapsed time. No effect if the stopwatch isn't running.
	/// </summary>
	void Stop() {
		if (m_isRunning) {
			m_elapsed += Clock::now() - m_start;
			m_isRunning = false;
		}
	}

	/// <summary>
	/// Stops running stopwatch and resets elapsed time to zero.
	/// </summary>
	void Reset() {
		m_isRunning = false;
		m_elapsed = Clock::duration::zero();
	}

	/// <summary>
	/// Reset elapsed time and start stopwatch again.
	/// </summary>
	void Restart() {
		Reset();
		Start();
	}


	/// <summary>
	/// Return split time. No effect if the stopwatch isn't running.
	/// </summary>
	Clock::duration GetSplitTime() const {
		if (m_isRunning) {
			return Clock::now() - m_start;
		} else {
			return Clock::duration::zero();
		}
	}
	/// <summary>
	/// Return split time in seconds. No effect if the stopwatch isn't running.
	/// </summary>
	double GetSplitTimeSeconds() const {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetSplitTime()).count();
	}
	/// <summary>
	/// Return split time in milliseconds. No effect if the stopwatch isn't running.
	/// </summary>
	double GetSplitTimeMilliseconds() const {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetSplitTime()).count();
	}
	/// <summary>
	/// Return split time in nanoseconds. No effect if the stopwatch isn't running.
	/// </summary>
	long long GetSplitTimeNanoseconds() const {
		return std::chrono::nanoseconds(GetSplitTime()).count();
	}


	/// <summary>
	/// Return interval time. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	Clock::duration GetIntervalTime() {
		if (m_isRunning) {
			const Clock::time_point start = Clock::now();
			const Clock::duration interval = start - m_start;

			m_elapsed += interval;
			m_start = start;
			return interval;
		} else {
			return Clock::duration::zero();
		}
	}
	/// <summary>
	/// Return interval time in seconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	double GetIntervalTimeSeconds() {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetIntervalTime()).count();
	}
	/// <summary>
	/// Return interval time in milliseconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	double GetIntervalTimeMilliseconds() {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetIntervalTime()).count();
	}
	/// <summary>
	/// Return interval time in nanoseconds. No effect if the stopwatch isn't running.
	/// Combination of GetSplitTime - Stop - Start
	/// </summary>
	long long GetIntervalTimeNanoseconds() {
		return std::chrono::nanoseconds(GetIntervalTime()).count();
	}


	/// <summary>
	/// Return elapsed time since first start after reset.
	/// </summary>
	Clock::duration GetElapsedTime() const {
		if (m_isRunning) {
			return m_elapsed + Clock::now() - m_start;
		} else {
			return m_elapsed;
		}
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in seconds.
	/// </summary>
	double GetElapsedTimeSeconds() const {
		using sec = std::chrono::duration<double>;
		return std::chrono::duration_cast<sec>(GetElapsedTime()).count();
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in milliseconds.
	/// </summary>
	double GetElapsedTimeMilliseconds() const {
		using ms = std::chrono::duration<double, std::milli>;
		return std::chrono::duration_cast<ms>(GetElapsedTime()).count();
	}
	/// <summary>
	/// Stop stopwatch and return elapsed time in nanoseconds.
	/// </summary>
	long long GetElapsedTimeNanoseconds() const {
		return std::chrono::nanoseconds(GetElapsedTime()).count();
	}
};

#include <iostream>
#include <sstream>
#include <iomanip>
#include <algorithm>
#include <execution>
#include <vector>
#include <omp.h>
#include <random>
#include <sycl/sycl.hpp>

#include "Stopwatch.h"

using Vector = std::vector<float>;

constexpr int BlockSize = 20;

//////////////////////////////////////////////////////////////////////////////////////////////
// serial vector addition
static void vectorAddition(const Vector& a, const Vector& b, Vector& c) {
	for (size_t i = 0; i < a.size(); ++i) {
		c[i] = a[i] + b[i];
	}
}

//////////////////////////////////////////////////////////////////////////////////////////////
// parallel vector addition with OMP
static void vectorAdditionOMP(const Vector& a, const Vector& b, Vector& c) {
	#pragma omp parallel
		#pragma omp master
		{
			std::cout << "number of processors = " << omp_get_num_procs() << std::endl;
			std::cout << "number of threads    = " << omp_get_num_threads() << std::endl;
		}

	#pragma omp parallel for default(none) shared(a, b, c, std::cout)
	for (size_t i = 0; i < a.size(); ++i) {
		c[i] = a[i] + b[i];
	}
}

//////////////////////////////////////////////////////////////////////////////////////////////
// parallel vector addition with transform
static void vectorAdditionParallel(const Vector& a, const Vector& b, Vector& c) {
	std::transform(std::execution::par, a.begin(), a.end(), b.begin(), c.begin(), [](auto ai, auto bi) {
		return ai + bi;
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// GPU vector addition
static void vectorAdditionSYCL(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	const sycl::nd_range<1> ndr(c.size(), BlockSize);	// workgroup size = BlockSize
	
	sycl::buffer aBuf(a);
	sycl::buffer bBuf(b);
	sycl::buffer cBuf(c);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// GPU vector addition (vectorized)
static void vectorAdditionSYCLvec(sycl::queue& q, const Vector& a, const Vector& b, Vector& c) {
	using VectorType = sycl::float4;
	constexpr int VectorSize = VectorType::size();

	const sycl::range<1> r(a.size()/VectorSize);
	const sycl::nd_range<1> ndr(r, BlockSize);	// workgroup size = BlockSize

	sycl::buffer<VectorType> aBuf((VectorType*)a.data(), r);
	sycl::buffer<VectorType> bBuf((VectorType*)b.data(), r);
	sycl::buffer<VectorType> cBuf((VectorType*)c.data(), r);

	q.submit([&](sycl::handler& h) {
		sycl::accessor aAcc(aBuf, h, sycl::read_only);
		sycl::accessor bAcc(bBuf, h, sycl::read_only);
		sycl::accessor cAcc(cBuf, h, sycl::write_only, sycl::no_init);

		h.parallel_for(ndr, [=](auto ii) { 
			const sycl::id<1> i = ii.get_global_id();

			cAcc[i] = aAcc[i] + bAcc[i]; 
		});
	});
}

//////////////////////////////////////////////////////////////////////////////////////////////
// Check and print results
static void check(const char text[], const Vector& ref, const Vector& result, double ts, double tp) {
	const double S = ts/tp;

	std::cout << std::setw(40) << std::left << text << result.size();
	std::cout << " in " << std::right << std::setw(7) << std::setprecision(2) << std::fixed << tp << " ms, S = " << S << std::endl;
	std::cout << std::boolalpha << "The two operations produce the same results: " << (ref == result) << std::endl;
}

//////////////////////////////////////////////////////////////////////////////////////////////
static void reset(Vector& v) {
	v.assign(v.size(), 0);
}

//////////////////////////////////////////////////////////////////////////////////////////////
static int BackendSelector(const sycl::device& dev) {
	return dev.get_backend() == sycl::backend::ext_oneapi_cuda;
}

//////////////////////////////////////////////////////////////////////////////////////////////
int main() {
	std::cout << "Vector Addition Tests" << std::endl;

	constexpr int N = 100'000'000;

	std::default_random_engine e;
	std::uniform_real_distribution<float> dist;

	// Create an exception handler for asynchronous SYCL exceptions
	auto exception_handler = [](sycl::exception_list e_list) {
		for (std::exception_ptr const& e : e_list) {
			try {
				std::rethrow_exception(e);
			} catch (std::exception const& e) {
#if _DEBUG
				std::cout << "Failure" << std::endl;
#endif
				std::terminate();
			}
		}
	};

	Vector a(N);
	Vector b(N);
	Vector r1(N);
	Vector r2(N);
	Stopwatch sw;

	for (int i = 0; i < N; ++i) {
		a[i] = dist(e);
		b[i] = dist(e);
	};

	sw.Start();
	vectorAddition(a, b, r1);
	sw.Stop();
	const double ts = sw.GetElapsedTimeMilliseconds();
	std::cout << std::endl << "Serial on CPU in " << ts << " ms" << std::endl;

	sw.Restart();
	vectorAdditionParallel(a, b, r2);
	sw.Stop();
	std::cout << std::endl;
	check("Parallel on CPU: ", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	reset(r2);

	sw.Restart();
	vectorAdditionOMP(a, b, r2);
	sw.Stop();
	std::cout << std::endl;
	check("OMP on CPU: ", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	reset(r2);

	// GPU processing: run-time is mainly memory-transfer time

	// Get the preferred device or another one if not available
	//auto selector = sycl::default_selector_v; // The default device selector will select the most performant device.
	//auto selector = sycl::aspect_selector(sycl::aspect::cpu); // uses the CPU as the underlying OpenCL device
	//auto selector = sycl::aspect_selector(sycl::aspect::gpu);
	auto selector = BackendSelector;
	sycl::queue q(selector, exception_handler);

	std::cout << std::endl << "SYCL on " << q.get_device().get_info<sycl::info::device::name>() << std::endl;
	std::cout << "native vector width for float is   : " << q.get_device().get_info<sycl::info::device::native_vector_width_float>() << std::endl;
	std::cout << "preferred vector width for float is: " << q.get_device().get_info<sycl::info::device::preferred_vector_width_float>() << std::endl;

	try {
		sw.Restart();
		vectorAdditionSYCL(q, a, b, r2);
		q.wait(); // wait until compute tasks on GPU done
		sw.Stop();
		std::cout << std::endl;
		check("GPU:", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	} catch (const std::exception& e) {
		std::cout << "An exception is caught for vector add: " << e.what() << std::endl;
	}
	reset(r2);

	try {
		sw.Restart();
		vectorAdditionSYCLvec(q, a, b, r2);
		q.wait(); // wait until compute tasks on GPU done
		sw.Stop();
		std::cout << std::endl;
		check("GPU vectorized:", r1, r2, ts, sw.GetElapsedTimeMilliseconds());
	} catch (const std::exception& e) {
		std::cout << "An exception is caught for vector add: " << e.what() << std::endl;
	}
	reset(r2);
}

duncan · 28 January 2025 15:54

Fantastic, thank you very much!

duncan · 21 February 2025 19:52

Hi @christoph.stamm,

I’m sorry it’s taken us a while to get to this, but I’ve been able to run the reproducer you posted on a Windows 10 machine with a 4060Ti and cannot reproduce the results you saw. In short, building in VSCode with oneAPI 2025.0, I get:

SYCL on NVIDIA GeForce RTX 4060 Ti
native vector width for float is : 1
preferred vector width for float is: 1

GPU: 100000000 in 302.81 ms, S = 0.19
The two operations produce the same results: true

GPU vectorized: 100000000 in 297.66 ms, S = 0.19
The two operations produce the same results: true

Then for building with 2025.0.1 in WSL, an Ubuntu 24.04 container, I get:

SYCL on NVIDIA GeForce RTX 4060 Ti
native vector width for float is : 1
preferred vector width for float is: 1

GPU: 100000000 in 325.79 ms, S = 0.17
The two operations produce the same results: true

GPU vectorized: 100000000 in 311.84 ms, S = 0.18
The two operations produce the same results: true

So we’ve been unable to see the difference. This makes sense to me as Intel’s SYCL implementation generally doesn’t have too much platform-specific handling in it.

One thing I did see is that the first run sometimes was slower than subsequent runs, so to get these numbers I ran the program a few times. Adding host-side loops in the code and discarding the first result should have a similar effect, just in case there’s something making the first kernel execution a bit slower than it should be.

Duncan.

sjcooke · 17 March 2025 21:35

I would like to add another data point to this issue as I am consistently seeing slow execution of SYCL kernels on Nvidia GPUs under Windows.

I compiled Intel’s OneAPI-samples/…/vector-add examples, adding -fsycl-targets=nvptx64-nvidia-cuda to the compile and link flags. I used the Nsight Compute profiler to measure only the kernel and not the data transfer overhead:

ncu -f -o profile vector-add-buffers.exe 100000000

When I view the profiles I see the following results for kernel timing and code size:

vector-add-buffers : 667ms, 1484 loc, 38 registers
vector-add-usm : 93ms, 368 loc, 36 registers

For comparison, an equivalent CUDA kernel (compiled with CUDA nvcc & VC++ 2022) gives:

vector-add-cuda : 1.74ms, 17 loc, 16 registers

The CUDA kernel is achieving around 85% of the theoretical data transfer rate, while the SYCL kernels are running more than 50 times slower, and with much higher code sizes and register use. Using buffers instead of USM causes additional overhead.

Does this give any more hints as to what the problem might be?

Simon

Software & hardware configuration:

Windows 11 24H2
OneAPI C++ Essentials Bundle (2025.0.1.28)
OneAPI for NVIDIA GPUs (2025.0.0)
CUDA Toolkit 12.5
NVIDIA RTX A6000 (driver version: 571.96)

duncan · 19 March 2025 19:23

Hi @sjcooke,

can you copy the compile command you have been using to run the code? We’ve not been able to replicate this internally so far so there might be some difference based on the compiler invocation.

sjcooke · 19 March 2025 21:05

Hi @duncan,

I’m compiling with the following:

icx-cl /EHsc /O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda src\vector-add-usm.cpp

then running from an admin console with a vector length of 10^8:

ncu -f -o profile vector-add-usm.exe 100000000

Here is an example of the graphical view:

duncan · 20 March 2025 16:44

Thank you very much. We will reopen this issue and investigate further next week.

Duncan.

rbielski · 18 April 2025 13:31

Hi @christoph.stamm and @sjcooke,
we found an issue with the icx-cl and icx compiler drivers on Windows. They don’t seem to be applying optimisations in device code compilation for the CUDA backend. We originally couldn’t reproduce the problem because we were trying only with the icpx driver.

Could you try compiling your applications again, but with the icpx compiler driver? Let us know if this results in the expected performance.

AytonDew · 18 April 2025 15:55

Hi @rbielski
I have experienced the original issue as well. I would like to try compiling my program with icpx, but it is not obvious how to compile with icpx using CMake on Windows.
CMake tries to use MSVC compatible flags, like /nologo, so simply changing CMAKE_CXX_COMPILER=icpx is not sufficient. Is there some guide to getting icpx to work with CMake on Windows?

sjcooke · 18 April 2025 18:07

Hi @rbielski,

Thank you for diagnosing the issue. Using icpx I now have the following:

vector-add-buffers : 1.75ms, 32 loc, 16 registers
vector-add-usm : 33ms, 20 loc, 16 registers

So vector-add-buffers is achieving performance equal to the reference CUDA benchmark.

For vector-add-usm the code size & register use is greatly reduced, but the execution time is reduced only by a factor of 3, still around 20 times slower than the reference.

I next tried changing the USM code to use malloc_device instead of malloc_shared and found that the performance issue is resolved:

vector-add-usm : 1.74ms, 20 loc, 16 registers (using malloc_device)

It looks like this is a separate issue with malloc_shared, and the original issue can be resolved as you determined.

For reference, I am compiling now using the following:

icpx -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda src\vector-add-usm.cpp -o vector-add-usm.exe

rbielski · 21 April 2025 10:12

Hi @sjcooke,
many thanks for testing! I’m glad icpx provides you with good performance and can be used as a workaround while we’re working on fixing icx-cl / icx.

The performance difference between malloc_device and malloc_shared looks correct to me. Direct and managed allocations are very different solutions with different performance implications. One also has to be very careful comparing different programming models like CUDA and SYCL and make sure the compared concepts are equivalent. In this case, sycl::malloc_shared is an equivalent of cudaMallocManaged so it cannot be compared to a program using cudaMalloc.

What happens with shared memory (managed memory in CUDA terminology) is that the host to device memory transfer is part of the kernel execution, since it’s triggered lazily on access to the data inside the kernel. Therefore, your vector-add-usm profile is actually measuring the time of memory transfer + computation, not just the computation.

rbielski · 21 April 2025 11:16

Hi @AytonDew,
could you try with -DCMAKE_SYSTEM_NAME=Linux? This workaround probably has some limitations, but it worked for me with the oneAPI-Samples vector-add after commenting out the CMAKE_FORCE_CXX_COMPILER(icx-cl) here:

github.com/oneapi-src/oneAPI-samples

DirectProgramming/C%2B%2BSYCL/DenseLinearAlgebra/vector-add/CMakeLists.txt

main


      
          if(UNIX)
              # Direct CMake to use icpx rather than the default C++ compiler/linker
              set(CMAKE_CXX_COMPILER icpx)
          else() # Windows
              # Force CMake to use icx-cl rather than the default C++ compiler/linker 
              # (needed on Windows only)
              include (CMakeForceCompiler)
              CMAKE_FORCE_CXX_COMPILER (icx-cl IntelDPCPP)
              include (Platform/Windows-Clang)
          endif()
          
          cmake_minimum_required (VERSION 3.4)
          
          project(VectorAdd CXX)
          
          set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
          set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})
          set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR})

I also had to rename the resulting binary from vector-add-buffers to vector-add-buffers.exe after building, but then it seems to run fine.

My commands were:

cmake -Bbuild -DCMAKE_CXX_COMPILER=icpx -G "NMake Makefiles" -DCMAKE_SYSTEM_NAME=Linux
cmake --build build
cd build
ren vector-add-buffers vector-add-buffers.exe
vector-add-buffers.exe

sjcooke · 21 April 2025 16:00

Hi @rbielski,

Thanks for clarifying. icpx is now working well for my application and I look forward to the future release.

christoph.stamm · 23 April 2025 17:52

Hi @rbielski

I’ve compiled my programs with icpx on Windows and now the performance is as expected.
I hope you can also find a solution for icx-cl and icx.

Thanks
Chris

Topic		Replies	Views
Working example on Windows 10 + Nvidia SYCL development	11	2059	19 April 2019
Utilizing Preallocated Device Memory oneAPI for NVIDIA GPUs	6	141	11 July 2024
Using debug option /MDd results in error on Windows oneAPI for NVIDIA GPUs	2	38	1 February 2025
Clarification regarding the output of sycl-ls command oneAPI for NVIDIA GPUs	4	875	10 August 2023
Windows version? oneAPI for NVIDIA GPUs	2	443	5 July 2023

Windows version is several times slower than Linux version

Related topics