I have run PortBlas on Nvidia 1650 TI and Intel Arc 770 and here are the benchmarks I have collected:
Portblas gemm:
gtx1650m/results.json
test_name, gflops
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 2261
BM_Gemm<float>/n/t/2048/2048/2048/buffer/real_time 2259
BM_Gemm<float>/t/n/2048/2048/2048/buffer/real_time 2211
BM_Gemm<float>/t/t/2048/2048/2048/buffer/real_time 2238
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 2253
BM_Gemm<float>/n/t/2048/2048/2048/usm/real_time 2221
BM_Gemm<float>/t/n/2048/2048/2048/usm/real_time 2196
BM_Gemm<float>/t/t/2048/2048/2048/usm/real_time 2233
cublas gemm
gtx1650m/results_cublas.json
test_name, gflops
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 2869
BM_Gemm<float>/n/t/2048/2048/2048/usm/real_time 2916
BM_Gemm<float>/t/n/2048/2048/2048/usm/real_time 1765
BM_Gemm<float>/t/t/2048/2048/2048/usm/real_time 2126
Got around 75% of cublas perf with portblas.
On Arc 770:
Port blas gemm
a770/results.json
test_name, gflops
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 8483
BM_Gemm<float>/n/t/2048/2048/2048/buffer/real_time 6046
BM_Gemm<float>/t/n/2048/2048/2048/buffer/real_time 4306
BM_Gemm<float>/t/t/2048/2048/2048/buffer/real_time 4472
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time 8388
BM_Gemm<float>/n/n/5792/5792/5792/buffer/real_time 8218
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 8556
BM_Gemm<float>/n/t/2048/2048/2048/usm/real_time 4973
BM_Gemm<float>/t/n/2048/2048/2048/usm/real_time 3779
BM_Gemm<float>/t/t/2048/2048/2048/usm/real_time 3988
BM_Gemm<float>/n/n/4096/4096/4096/usm/real_time 8207
BM_Gemm<float>/n/n/5792/5792/5792/usm/real_time 8072
There’s no benchmark for oneMKL so I used data from Intel’s PyTorch (which is based on OneMKL)
benchmarking xpu using torch.float32
size, elapsed_time, tops
1024, 0.00040395259857177733, 5.3161773326689445
1217, 0.00824732780456543, 0.43710771675698584
1448, 0.0006613254547119141, 9.181643834721442
1722, 0.0006860971450805664, 14.884828146021194
2048, 0.011794257164001464, 1.4566300314729907
2435, 0.0018853187561035157, 15.315885261587333
2896, 0.0030889034271240233, 15.726111035212236
3444, 0.005404114723205566, 15.118018205123926
4096, 0.008573794364929199, 16.030120110436652
4870, 0.015609407424926757, 14.798935008327769
5792, 0.02517538070678711, 15.43617197698357
6888, 0.04176149368286133, 15.650686326198912
As you can see, I am only reaching about 50% of what is achieved by OneMKL. Is there something I can do to improve on this? By the way here are the build commands for reference
Nvidia: cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_ARCH=sm_75 -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_BUILD_TYPE=Release -DBUILD_CUBLAS_BENCHMARKS=ON
Intel: cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DTUNING_TARGET=INTEL_GPU -DCMAKE_BUILD_TYPE=Release