How to build PortBLAS for Intel GPUs

Hi,

I have been trying to compile portblas for Intel GPUs and I have not been successful so far. I got stuck at this issue: Build fails: cannot import name 'generate_file' from 'py_gen' · Issue #496 · codeplaysoftware/portBLAS · GitHub

I built using docker but that doesn’t detect Intel GPUs despite using dpcpp compiler. Please help me fix this.

Hi @chsasank,
I’m not able to reproduce this I’m afraid. When I try to build portBLAS the first step proceeds without error, and if I run the command to generate the files outside of ninja it runs without error.

You might need to run git submodule init; git submodule update before building portBLAS.

I hope this helps,
Duncan.

Ah there are submodules?! I needed to do recursive clone then. I did standard clone and that’s why it failed. It is building now! I will post results of my benchmarks in some time :slight_smile:

I have run PortBlas on Nvidia 1650 TI and Intel Arc 770 and here are the benchmarks I have collected:

Portblas gemm:

gtx1650m/results.json
test_name, gflops
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 2261
BM_Gemm<float>/n/t/2048/2048/2048/buffer/real_time 2259
BM_Gemm<float>/t/n/2048/2048/2048/buffer/real_time 2211
BM_Gemm<float>/t/t/2048/2048/2048/buffer/real_time 2238
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 2253
BM_Gemm<float>/n/t/2048/2048/2048/usm/real_time 2221
BM_Gemm<float>/t/n/2048/2048/2048/usm/real_time 2196
BM_Gemm<float>/t/t/2048/2048/2048/usm/real_time 2233

cublas gemm

gtx1650m/results_cublas.json
test_name, gflops
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 2869
BM_Gemm<float>/n/t/2048/2048/2048/usm/real_time 2916
BM_Gemm<float>/t/n/2048/2048/2048/usm/real_time 1765
BM_Gemm<float>/t/t/2048/2048/2048/usm/real_time 2126

Got around 75% of cublas perf with portblas.

On Arc 770:

Port blas gemm

a770/results.json
test_name, gflops
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 8483
BM_Gemm<float>/n/t/2048/2048/2048/buffer/real_time 6046
BM_Gemm<float>/t/n/2048/2048/2048/buffer/real_time 4306
BM_Gemm<float>/t/t/2048/2048/2048/buffer/real_time 4472
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time 8388
BM_Gemm<float>/n/n/5792/5792/5792/buffer/real_time 8218
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 8556
BM_Gemm<float>/n/t/2048/2048/2048/usm/real_time 4973
BM_Gemm<float>/t/n/2048/2048/2048/usm/real_time 3779
BM_Gemm<float>/t/t/2048/2048/2048/usm/real_time 3988
BM_Gemm<float>/n/n/4096/4096/4096/usm/real_time 8207
BM_Gemm<float>/n/n/5792/5792/5792/usm/real_time 8072

There’s no benchmark for oneMKL so I used data from Intel’s PyTorch (which is based on OneMKL)

benchmarking xpu using torch.float32
size, elapsed_time, tops
1024, 0.00040395259857177733, 5.3161773326689445
1217, 0.00824732780456543, 0.43710771675698584
1448, 0.0006613254547119141, 9.181643834721442
1722, 0.0006860971450805664, 14.884828146021194
2048, 0.011794257164001464, 1.4566300314729907
2435, 0.0018853187561035157, 15.315885261587333
2896, 0.0030889034271240233, 15.726111035212236
3444, 0.005404114723205566, 15.118018205123926
4096, 0.008573794364929199, 16.030120110436652
4870, 0.015609407424926757, 14.798935008327769
5792, 0.02517538070678711, 15.43617197698357
6888, 0.04176149368286133, 15.650686326198912

As you can see, I am only reaching about 50% of what is achieved by OneMKL. Is there something I can do to improve on this? By the way here are the build commands for reference

Nvidia: cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_ARCH=sm_75 -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_BUILD_TYPE=Release -DBUILD_CUBLAS_BENCHMARKS=ON
Intel: cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DTUNING_TARGET=INTEL_GPU -DCMAKE_BUILD_TYPE=Release

Hi @chsasank,

I don’t know if there would be a way to improve these results significantly. The team is continuing to work on the library and you’re likely to be seeing a reasonable result with what you’ve run so far.

You could try the auto tuner as I don’t think it’s been run for your hardware specifically, and might produce some different combinations of parameters that perform better: portBLAS/tools/auto_tuner at master · codeplaysoftware/portBLAS · GitHub

The documentation mentions ComputeCpp, but you should be able to follow the dpcpp steps instead, and then run the tuner binaries. This will likely take a while as there are a lot of configurations to run through.

Hi,

I have been trying to build autotuner but CMake fails with following error.

Command I used: cmake -GNinja …/ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET=spir64 -DTUNING_TARGET=INTEL_GPU -DCMAKE_BUILD_TYPE=Release

Error:

CMake Error in tools/auto_tuner/CMakeLists.txt:                                                                                                                                                                    
  Imported target "DPCPP::DPCPP" includes non-existent path                                                                                                                                                        
                                                                                                                                                                                                                   
    "/opt/intel/oneapi/2024.0/bin/compiler/../include/sycl"                                                                                                                                                        

  in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

  * The path was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and references files it does not
  provide.



CMake Error in tools/auto_tuner/CMakeLists.txt:
  Imported target "DPCPP::DPCPP" includes non-existent path

    "/opt/intel/oneapi/2024.0/bin/compiler/../include/sycl"

  in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

  * The path was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and references files it does not
  provide.

sycl headers are at /opt/intel/oneapi/2024.0/include/sycl/ in my system. Not sure what’s wrong.

Hi @chsasank,
this is a problem in the portBLAS FindDPCPP cmake module. The directory structure of the oneAPI toolkit installations has changed between 2023.x and 2024.x versions and it looks like portBLAS cmake doesn’t support the new structure when using the clang++ driver which is located under /opt/intel/oneapi/compiler/2024.0/bin/compiler.

Could you please open an issue in the portBLAS github repo reporting the command you used and the error? We’ll make sure someone attends this problem soon.

As a workaround, this should work if you use the icpx compiler driver, which is located in /opt/intel/oneapi/compiler/2024.0/bin so the relative path ../include/sycl will be correct. You can do that either by setting the CXX=icpx environment variable or adding -DCMAKE_CXX_COMPILER=icpx in the cmake command.

Thanks,
Rafal

1 Like

Thanks for the quick response! I have raised an issue with github here: Autotuner Build fails · Issue #498 · codeplaysoftware/portBLAS · GitHub. Your suggestion does fix the issue.

It’s surprising that this issue didn’t show up when I built portblas without any autotuning.

1 Like