I’ve recently started learning SYCL and I’m using CodePlay’s ComputeCpp CE 2.3 to build some sample proglets (that I cannot link since new users can only put two links in a post; links to the proglets can be found in the LLVM and pocl issue linked below). I’ve successfully managed to run sample (and the equivalent sample-select on both CPU and GPU devices without issues. sample-reduce implements a simple parallel reduction using local memory, and it works correctly on the CPU and on the Intel iGP, but when I try to run it on a CUDA GPU through pocl I get a segmentation fault in LLVM. Details about the crash have also been reported in the LLVM issue tracker and in the pocl issue tracker.
As detailed in the LLVM issue tracker in particular, the issue seems specifically related to the declaration of the local memory for the reduction kernel.
based on the other bug reports it’s likely this is an issue with POCL since the segfault happens at the point of calling clEnqueueNDRangeKernel. It’s possible compute++ produces invalid SPIR, though that seems a bit unlikely considering the sample seems to work on other devices.
One option to try is to produce SPIR-V instead of SPIR. I don’t know how well POCL supports SPIR-V, though, I could only find a mention for the CPU side in version 1.4. You can enable SPIR-V by passing -sycl-target spirv64 to compute++. Another option you could try is to enable optimizations for compute++, see if that helps.
A quick look at the code doesn’t reveal any problems. I’ve noticed a barrier in the code, we’ve seen issues with that on some devices in the past (repeatedly), the problem was with the OpenCL driver.
from my analysis I would guess that there are at least two issues at play here.
One issue is that LLVM is segfaulting on bad input, instead of failing gracefully. This is most definitely a LLVM bug, even if the SPIR it receives is invalid. The other issue is the invalid SPIR. It’s possible that pocl is producing the invalid SPIR, but I’ve actually tried to purge the post-processing it does of the SPIR it receives, so that it’s limited to the extraction of the kernel and call-graph, and I’m still getting the segfault. (Also, I’m afraid pocl doesn’t support SPIR-V yet.)
The interesting thing is that the segfault happens even when the lmem is not used at all: even if the body of reduce::operator() is commented out (null kernel) the segfault is still there, and in this case there’s really no transformation being done. Maybe pocl is mangling stuff up during the kernel callgraph extraction? I’ll update the pocl issue to mention this.
(OT: there was another issue I reported a couple of weeks ago, but it seems Akismet ate it up —any chance of recovering it or should I try resubmitting it?)
Just to confirm, did you try enabling optimizations for compute++? At least -O1, preferably -O2. Not sure if it will help, but there is a chance it might.
Hello @peterzuzek, thanks for your reply. I have tried with different optimization options, but it doesn’t seem to make a difference. I have forwarded in the mean time some additional information to the pocl developers, including the SPIR produced by computec++ so that they may analyze it. Judging from this comment it would seem that some kind of tricks are needed to support local memory, so this may be part of the reason for the failure, in addition to a possible discrepancy between the LLVM IR used. They’ll be looking into it, I’ll keep you posted when I have news.