CUDA_ERROR_UNKNOWN 999 when verifying installation using sycl-ls on NVIDIA GeForce RTX 3080 Ti

Hi!

I installed CUDA SDK 12, oneAPI DPC++ and the oneAPI plugin for NVIDIA GPUs on debian linux 12.5. I can run nvidia-smi and it does not report any errors.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3080 Ti     On  |   00000000:01:00.0 Off |                  N/A |
|  0%   46C    P8             19W /  350W |       2MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

When I try to verify my installation, I see this:

$ sycl-ls
UR CUDA ERROR:
	Value:           999
	Name:            CUDA_ERROR_UNKNOWN
	Description:     unknown error
	Function:        operator()
	Source Location: /tmp/tmp.ybx1IaPQyS/intel-llvm-mirror/build/_deps/unified-runtime-src/source/adapters/cuda/platform.cpp:75

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]

Running with SYCL_PI_TRACE=1, I see:

$ SYCL_PI_TRACE=1 sycl-ls
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 14.39.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 14.39.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 14.39.1 ]

UR CUDA ERROR:
	Value:           999
	Name:            CUDA_ERROR_UNKNOWN
	Description:     unknown error
	Function:        operator()
	Source Location: /tmp/tmp.ybx1IaPQyS/intel-llvm-mirror/build/_deps/unified-runtime-src/source/adapters/cuda/platform.cpp:75

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]

So I take it it loads the plugin correctly. This is where I ran out of wit to troubleshoot this problem. I am pretty new to all GPU programming, so I am looking for help with this problem.

Hi @andrsd,

This is quite an unusual error. That being said, at the moment we are testing against CUDA 12.4 internally.
Have you installed CUDA from Nvidia directly? As far as I can tell Debian Stable should be on version 11 of CUDA, by my reading of the packages. Are normal CUDA programs working properly and as expected?

I’ve looked at the source file which is throwing the error and it really is happening at the point where it queries the devices available on the system, which implies that these functions are just failing out of the gate. It’s one of the first things that happens when running any SYCL program.

Yes, I followed their instructions for linux.

The only thing I tested so far was building libocca against CUDA and their test suite passed with no problems. However, if I build only against OpenCL, I see some failures, but that may not be relevant. I can try to run something CUDA specific, if you have any advice what to try. This is literally my first installation of CUDA, so I do not have anything else to go by at the moment.

Hi @andrsd,
you can test your CUDA installation with cuda-samples, e.g. like this:

# Download source files for the vector add example
wget https://github.com/NVIDIA/cuda-samples/raw/master/Samples/0_Introduction/vectorAdd/vectorAdd.cu
wget https://github.com/NVIDIA/cuda-samples/raw/master/Common/helper_cuda.h
wget https://github.com/NVIDIA/cuda-samples/raw/master/Common/helper_string.h
# Compile and run
nvcc -o vectorAdd vectorAdd.cu -I$PWD
./vectorAdd

Just to check - did you download the latest plugin version from developer.codeplay.com?

I tested the following debian 12.5 docker image with the latest CUDA, oneAPI toolkit and Codeplay plugin, and it worked for me:

FROM debian:12.5

# System prerequisites
RUN apt update && apt -y install gcc wget gpg-agent curl \
    && apt clean all

# Register external repos
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb \
    && dpkg -i cuda-keyring_1.1-1_all.deb \
    && wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
    | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null \
    && echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \
    | tee /etc/apt/sources.list.d/oneAPI.list

# Install CUDA and DPC++
RUN apt update && apt -y install cuda-toolkit intel-oneapi-compiler-dpcpp-cpp \
    && apt clean all

# Install Codeplay's plugin
RUN curl -LOJ "https://developer.codeplay.com/api/v1/products/download?product=oneapi&variant=nvidia&version=2024.1.0&filters[]=12.0&filters[]=linux" \
    && bash oneapi-for-nvidia-gpus-2024.1.0-cuda-12.0-linux.sh -y

I see:

$ SYCL_PI_TRACE=1 sycl-ls
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 14.39.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 14.39.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 14.39.1 ]
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i9-12900K OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3060 8.6 [CUDA 12.5]

so I believe this should work in general.

To check we’re on the same page with package versions, could you post the output of the following command from your system?

apt list --installed | grep 'gcc\|libstdc\|libnvidia\|cudart\|dpcpp-cpp'

Yes, I downloaded and installed oneapi-for-nvidia-gpus-2024.1.0-cuda-12.0-linux.sh. Sorry for not mentioning that in my first post.


Installed packages

$ apt list --installed | grep 'gcc\|libstdc\|libnvidia\|cudart\|dpcpp-cpp'
cuda-cudart-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic]
cuda-cudart-dev-12-5/unknown,now 12.5.39-1 amd64 [installed,automatic]
gcc-12-base/stable,now 12.2.0-14 amd64 [installed]
gcc-12/stable,now 12.2.0-14 amd64 [installed,automatic]
gcc/stable,now 4:12.2.0-3 amd64 [installed]
libgcc-12-dev/stable,now 12.2.0-14 amd64 [installed,automatic]
libgcc-s1/stable,now 12.2.0-14 amd64 [installed]
libnvidia-allocator1/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-cfg1/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-egl-gbm1/stable,now 1.1.0-2 amd64 [installed,automatic]
libnvidia-eglcore/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-encode1/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-fbc1/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-glcore/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-glvkspirv/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-gpucomp1/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-ml1/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-nvvm4/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-opticalflow1/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-pkcs11/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-ptxjitcompiler1/unknown,now 555.42.02-1 amd64 [installed,automatic]
libnvidia-rtcore/unknown,now 555.42.02-1 amd64 [installed,automatic]
libstdc++-12-dev/stable,now 12.2.0-14 amd64 [installed,automatic]
libstdc++6/stable,now 12.2.0-14 amd64 [installed]
linux-compiler-gcc-12-x86/stable-security,now 6.1.90-1 amd64 [installed,automatic]

I tried the vectorAdd sample from cuda and I get this error:

$ ./vectorAdd
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code unknown error)!

So I am assuming CUDA is not correctly installed. I will try to reinstall using your instructions to see if that helps…

Also thanks for trying to replicate my problem, and sharing the steps. It is very helpful for a GPU rookie like myself.

If I run sycl-ls as root, I see this:

$ source /opt/intel/oneapi/setvars.sh --include-intel-llvm
$ SYCL_PI_TRACE=1 sycl-ls
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 14.39.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 14.39.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 14.39.1 ]
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2024.17.5.0.08_160000.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz OpenCL 3.0 (Build 0) [2024.17.5.0.08_160000.xmain-hotfix]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3080 Ti 8.6 [CUDA 12.5]

So I must be running into permission issues. What groups, do I have to be in as a user? I manually added myself into render group. I saw that in some troubleshooting guide, but obviously that is not enough. I am currently a member of the following groups:

cdrom floppy sudo audio dip video plugdev users render netdev bluetooth lpadmin scanner

Also, running the CUDA sample as root, gives me:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

This indeed looks like a problem with file permissions in your CUDA installation. The groups video and render are the right ones. These are the permissions for my GPU device:

$ ls -l /dev/dri
total 0
drwxr-xr-x  2 root root        120 May 29 19:48 by-path
crw-rw----+ 1 root video  226,   0 May 29 19:48 card0
crw-rw----+ 1 root video  226,   1 May 29 19:48 card1
crw-rw----+ 1 root render 226, 128 May 29 19:48 renderD128
crw-rw----+ 1 root render 226, 129 May 29 19:48 renderD129

I’m not sure if it’s the device permissions that are missing, because your nvidia-smi works fine. Might be permissions on the CUDA runtime libraries (mine have the read permission for all users):

$ ls -l /usr/local/cuda/targets/x86_64-linux/lib/libcudart*
lrwxrwxrwx 1 root root      15 Apr 16 03:49 /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so -> libcudart.so.12
lrwxrwxrwx 1 root root      20 Apr 16 03:49 /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12 -> libcudart.so.12.5.39
-rw-r--r-- 1 root root  712032 Apr 16 03:49 /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12.5.39
-rw-r--r-- 1 root root 1429586 Apr 16 03:49 /usr/local/cuda/targets/x86_64-linux/lib/libcudart_static.a

I’m afraid that’s all the ideas I have for this one. If that doesn’t help, perhaps you could seek further information in the NVIDIA forums as this looks to be a CUDA installation problem?

I have good news. All works now. I looked around the NVIDIA forum for problems with permissions and I found some information about what permissions should be expected where. Basically I found the information you posted plus this:

$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 May 30 16:50 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 May 30 16:50 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 May 30 16:50 /dev/nvidia-modeset
crw-rw-rw- 1 root root 238,   0 May 30 16:54 /dev/nvidia-uvm
crw-rw-rw- 1 root root 238,   1 May 30 16:54 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
cr-------- 1 root root 242, 1 May 30 17:20 nvidia-cap1
cr--r--r-- 1 root root 242, 2 May 30 17:20 nvidia-cap2

I see all this on my system and I can run the CUDA sample. Running sycl-ls lists the device when running under normal user as it is supposed to.

I assume that something got pulled in by running apt install cuda-toolkit, but it needed a restart which I did not do yesterday, but happened today. I also found that nvidia-modprobe package should be responsible for “fixing” permission on the devices, but mine was installed before I started this thread, so that may not be related.

I wish I knew what really fixed my problem. If anybody sees the error, my advice would be to start checking the permission on files/devices.

Last, but not least: thank you @rbielski and @duncan for your time helping me to resolve this problem. Much appreciated!

2 Likes