Clarification regarding the output of sycl-ls command

Hello,

First of all, thanks for releasing the plugin for NVIDIA devices.

I appreciate if you can clarify a few information related to the output of the sycl-ls command

  1. Specs:
  • CPU: 2 x AMD EPYC 7543
  • GPU: 8 x NVIDIA A100-SXM4 (510.47.03)
  • OS: CentOS Linux release 7.9.2009
  1. oneAPI and plugin:
  • Intel oneAPI Base Toolkit 2023.2.0
  • Plugin version: oneapi-for-nvidia-gpus-2023.2.0-cuda-12.0-linux.sh
  • Environment is setup with
    $ . setvars.sh --include-intel-llvm
  1. sycl-ls output
    [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL™, Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734]
    [opencl:cpu:1] Intel(R) OpenCL, AMD EPYC 7543 32-Core Processor 3.0 [2023.16.6.0.22_223734]

  2. sycl-ls output with CUDA_VISIBLE_DEVICES=0,1
    [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL™, Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734]
    [opencl:cpu:1] Intel(R) OpenCL, AMD EPYC 7543 32-Core Processor 3.0 [2023.16.6.0.22_223734]
    [ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-80GB 8.8 [CUDA 11.6]
    [ext_oneapi_cuda:gpu:1] NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-80GB 8.8 [CUDA 11.6]

    I was able to list cuda devices via CUDA_VISIBLE_DEVICES. However:

    • Only one EYPC is listed. My understanding is that each EYPC will be treated as an device.
    • What is the meaning of the sequence at the end of device info, e.g. 2023.16.6.0.22_223734 ?
  3. As shown here for another node with dual EYPC config:
    [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL™, Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734]
    [opencl:cpu:1] Intel(R) OpenCL, AMD EPYC 7543 32-Core Processor 3.0 [2023.16.6.0.22_223734]
    [opencl:cpu:2] Intel(R) OpenCL, AMD EPYC 7543 32-Core Processor 3.0 [2022.13.3.0.16_160000]
    [ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100 80GB PCIe 8.8 [CUDA 11.6]
    [ext_oneapi_cuda:gpu:1] NVIDIA CUDA BACKEND, NVIDIA A100 80GB PCIe 8.8 [CUDA 11.6]
    Here two EYPC CPUs have different ‘time stamp’, i.e. 2023 vs. 2022.

Thanks for reading.
Regards.

Hi @vitduck,
regarding your question about OpenCL CPU backend - a multi-core CPU or multiple CPUs (in a multi-socket machine) constitute a single CPU OpenCL device. You can control affinity of the process to limit the visibility of cores through operating system specific APIs (e.g. the taskset command or the sched_setaffinity C API for Linux). OpenCL supports partitioning devices (Section 4.3 of the OpenCL 1.2 spec) so it should be also possible to use sycl::device::create_sub_devices (Section 4.6.4 of the SYCL 2020 spec) to select one of the two CPUs.

Regarding the sycl-ls output formatting - the last part in the square brackets is the driver version. In your second case, I think this means that the OpenCL CPU backend is available through two installed driver versions, but both “opencl:cpu:1” and “opencl:cpu:2” represent the entire multi-socket CPU system.

Cheers,
Rafal

Hi @rbielski

Thanks for your clarification, especially regarding the usage of sycl::device::create_sub_devices
I was able to build and run sample code. BabelStream also gives respectable bandwidth.

I am checking how the 2022 driver sneaks in the 2nd case. But to reiterate in case I misunderstand:

  • OpenCL backend will consume all available resources, e.g. all NUMA nodes or all MIGs
  • To limit resource, there are two ways:
    • Partition devices along NUMA boundaries with sycl::device::create_sub_devices
    • Using taskset, e.g. taskset --cpu-list 0-7 ./test.x

I hope you don’t mind some addition questions

  • There used to be a host device. Was it removed recently ?
  • Regarding the cuda version comparability:
    • The 510.47.03 driver supports cuda 11.6, from the output of nvidia-smi
    • The CodePlay plugin is built against cuda 12.0
    • If I load cuda 11.6 env module, both V100 and A100 works as expected.
    • if I load cuda 12.0 env module, the following error is generate
Running on device: Tesla V100-SXM2-32GB
Vector size: 10000

PI CUDA ERROR:
	Value:           222
	Name:            CUDA_ERROR_UNSUPPORTED_PTX_VERSION
	Description:     the provided PTX was compiled with an unsupported toolchain.
	Function:        build_program
	Source Location: /root/intel-llvm-mirror/sycl/plugins/cuda/pi_cuda.cpp:776


PI CUDA ERROR:
	Value:           400
	Name:            CUDA_ERROR_INVALID_HANDLE
	Description:     invalid resource handle
	Function:        cuda_piProgramRelease
	Source Location: /root/intel-llvm-mirror/sycl/plugins/cuda/pi_cuda.cpp:3640

An exception is caught while adding two vectors.
terminate called after throwing an instance of 'sycl::_V1::compile_program_error'
  what():  The program was built for 1 devices
Build program log for 'Tesla V100-SXM2-32GB':
 -999 (Unknown PI error)
Aborted
  • The warning message when using CUDA 12.0 is
$ icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda vector-add-usm.cpp 
icpx: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]
  • So despite being built against CUDA 12.0, we should use CUDA 11.8 for best comparability ?

Thanks.

Hi @vitduck,
yes, your understanding of the OpenCL CPU situation looks correct to me.

Indeed the removal of the host device was one of the changes between the SYCL 1.2.1 and SYCL 2020 specification. You can see more details on what changed here:
https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:what-changed-between

SYCL 2020 implementations can still provide CPU devices and Intel oneAPI toolkits do that with the OpenCL CPU backend. In future versions of the DPC++ compiler there will be also a native_cpu target available which compiles SYCL code to standard C++ CPU code. Work is currently in progress in the open source:
https://github.com/intel/llvm/blob/sycl/sycl/doc/design/SYCLNativeCPU.md

Regarding your issue with incompatible PTX, it looks like your driver is too old for CUDA 12. Codeplay’s oneAPI plugin should work well with CUDA 11.6 and likely even older versions (though we have not tested them). However, the PTX code generated with CUDA 12 cannot be executed by the driver you have. According to CUDA documentation, CUDA 12 for Linux requires driver version 525.60.13 or newer:
https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility

The warning:

icpx: warning: CUDA version is newer than the latest partially supported version 11.8 [-Wunknown-cuda-version]

can be safely ignored. It only means that in some corner cases DPC++ may be unable to leverage all the newest features of CUDA 12.0 like certain optimisations or new instructions, but it should still generate correct code. See also our Troubleshooting page:
https://developer.codeplay.com/products/oneapi/nvidia/2023.2.1/guides/troubleshooting#compiler-warning-cuda-version-is-newer-than-the-latest-supported-version

Hope this helps, I’m happy to answer more questions if you have any.
Cheers,
Rafal

@rbielski

The SYCLNativeCPU feature would be more useful than original host device.
I will refer to the links you shared for details, but I have understood the issues clearly.

Thanks very much for the discussion.

Regards.