Poor memory performance?

I am trying to develop a basic memory benchmark for testing purposes and before setting it loose on the GPU I am running it on the CPU. Unfortunately the performance is very poor. The fundamental objects I am copying are

struct bigThing{
  double v[32];
};

I allocate a read and write buffer of 1GB and attempt a copy under a threaded loop. A simple openmp implementation:

int thr = 8;
size_t fsize = 1024*1024*1024/sizeof(bigThing);
size_t work_per_thr = fsize / nthr;

omp_set_num_threads(nthr);

time.start();
#pragma omp parallel
{
    size_t off = omp_get_thread_num() * work_per_thr;
    for(size_t i=off; i< off + work_per_thr; i++){		
	put[i] = buf[i];
    }
}
time.stop();
ns += time.elapsed();

Gets about 6.7 GB/s:

Device: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
Max compute units: 8
Max workgroup size: 8192
Avg time 0.149242s for size 1: 6.70052 GB/s

whereas the (what I believe to be) equivalent SYCL code,

{
  sycl::range<1> range(fsize);
  sycl::buffer<bigThing, 1> buf_b(buf, range);
  sycl::buffer<bigThing, 1> put_b(put, range);

  time.start();
  
  queue.submit([&] (sycl::handler& cgh) {
	  auto buf_acc = buf_b.get_access<sycl::access::mode::read>(cgh);
	  auto put_acc = put_b.get_access<sycl::access::mode::discard_write>(cgh);
	  
	  cgh.parallel_for<class mem_bench>(sycl::range<1>(nthr),  [=](sycl::id<1> idx){
	      size_t off = idx.get(0) * work_per_thr;
	      for(size_t i=off; i< off + work_per_thr; i++){		
	       	put_acc[i] = buf_acc[i];
	      }
	    });
	});

  queue.wait();
  
  time.stop();
  ns += time.elapsed();
}

only gets 0.8 GB/s

Running on:
Device: Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz
Max compute units: 8
Max workgroup size: 8192
Avg time 1.26331s for size 1: 0.791572 GB/s

The strange thing is that if I comment out the loop body (i.e. don’t actually do the copy!) it still takes the same amount of time to execute. It is only when I comment out the creation of the accessors that the time drops (to about 0.03s).

What I think is happening is that the read accessor is performing a full, single-threaded copy of my buffer upon instantiation, and the write accessor is performing a full, single-threaded copy back. Is this correct? Is there any way to get it to read and write directly from the host buffers?

Note if I run this same code (but with 72 threads) on the builtin CPU, I get

Running on:
Device: Intel(R) Gen9 HD Graphics NEO
Max compute units: 72
Max workgroup size: 256
Avg time 1.19333s for size 1: 0.837993 GB/s

This is still much slower than I would expect, suggesting a similar issue is happening for the GPU side.

I would appreciate any help understanding this.

Note:

$ compute++ -v
Codeplay ComputeCpp - CE 1.1.1 Device Compiler - clang version 6.0.0 (git@git.office.codeplay.com:sycl/clang.git 5f3e9ee120701264b20a61ad4a3f71cb9bd130e2) (git@git.office.codeplay.com:sycl/llvm.git 891c236fd114c1665ca11403fc3e5b32550e7f61) (based on LLVM 6.0.0svn)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /<redacted>/ComputeCpp-CE-1.1.1-Ubuntu-16.04-x86_64/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.4.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/8
Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/7.4.0
Candidate multilib: .;@m64
Selected multilib: .;@m64

and Makefile contents:

main.sycl: main.C
	compute++ -c -no-serial-memop  -sycl $(CXXFLAGS)  main.C $(LDFLAGS)
sycl: main.sycl main.C
	g++ -std=c++11  $(CXXFLAGS) -include main.sycl main.C $(LDFLAGS) -o main_sycl.x
openmp: main.C
	g++ -std=c++11 -DUSE_OMP $(CXXFLAGS) -fopenmp main.C $(LDFLAGS) -o main_openmp.x

Could you please run your makefile with VERBOSE=1 and paste the result?

What I am looking for is the full command used to run compute++ to see if you are enabling optimisations.

For benchmarking purposes, you should run both compute++ and g++ with -O3. I strongly recommend for you use the FindComputeCpp.cmake provided in our SDK because it builds the full command line for compiling sycl files, that is if you can use CMake.

Another thing to look for is BabelStream. They do pretty much what you are trying to do and the code is quite simple to follow.

1 Like

This code does not offer a fair comparison for a few reasons.

The OpenMP code is pure CPU and no compilation happens at runtime.
SYCL does JIT compilation of the SPIR/SPIR-V output generated by ComputeCpp. This means there is compilation that happens when the binary is first executed so measuring the first execution of the application includes this additional step. Timing from the second iteration following a queue->wait() call.

It is possible to pre-build the kernels using program and build_with_kernel_type, there’s an example in the BabelStream code that was referenced in the previous reply.
Code starts at this line: https://github.com/UoB-HPC/BabelStream/blob/master/SYCLStream.cpp#L72

The OpenMP code all happens on the main memory so no data transfer is performed. SYCL is required to transfer into memory because this is the way it is done in OpenCL 1.2. This also explains why commenting out the kernel body does not remove the transfer cost.

So you should also time the performance from the second iteration of the program using SYCL to get an equivalent performance comparison. As mentioned the BabelStream code is a good place to look.

Thank you both for your responses. After enabling -O3 it boosted the openmp implementation performance to 7.5 GB/s but did not change the sycl version performance. Timing from the second iteration to account for JIT compilation overheads improved the sycl performance to 0.82 GB/s, but it is still far below the openmp version.

As you have explained, this is clearly due to there being a full copy of my 1GB memory region occurring before kernel execution, and presumably another one one the way back. It seems this is a fundamental limitation of building on top of OpenCL. It would be nice if the developers would consider a means to allow SYCL to take advantage of unified virtual memory where available to avoid these bulk transfers.

The memory transfer exists for any heterogeneous platform (not just OpenCL) where devices don’t have direct access to CPU memory, and code written for CPU cannot (especially if it’s memory bound) be expected to run fast on those heterogeneous devices. Instead the code has to be written so that any data needed by the kernel is transferred to the device efficiently by trying to hide the transfer latency for example using double buffering. These platforms were designed to work most effectively on a GPU or similar many core processor.

In terms of the initial JIT step, if the target platform is known then offline compilation can be used to avoid this initial JIT compilation time.