Native API Error with Shared Library

Hello everyone,

I’m currently attempting to call a shared library compiled with icpx from Python using the ctypes package’s CDLL FFI. I will try and give as much context to my specific problem as possible, but if you would like anything else please let me know.

The system I am using is configured as follows:

  • OS: Red Hat Enterprise Linux 8.7 (Ootpa)
  • CPU: Intel Xeon Gold 6226
  • GPU: NVIDIA Tesla V100-PCIE-32GB
  • GPU Driver: 535.54.03
  • OneAPI Version: 2024.0
  • OneMKL Version: 0.4
    • Built for CUDA
  • Python Version: 3.9.6

Note: I have also tried this on an A100-PCIE-80GB, however I am unable to access them at the moment so I can’t get further information about them. But, the error raised mirrors the one I will show here.

As a minimum working example, I have the following code for compilation to a standalone executable:

#include <sycl/sycl.hpp>
#include "oneapi/mkl/blas.hpp"
#include <vector>
#include <cmath>

#include <iostream>

using namespace sycl;
namespace mkl = oneapi::mkl;

extern "C" void sdp_cublas_backend(float* qq, int Q, float* kk, int K, int Embedded_D,  float* out, float* vv, float* out_fin) {

        queue q(gpu_selector_v);
        std::cout << "The device is: " << q.get_device().get_info<info::device::name>() << "\n";

        auto qQ = buffer(qq, range<1> { static_cast<size_t>(Q * Embedded_D) });
        auto kK = buffer(kk, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oO = buffer(out, range<1> { static_cast<size_t>(Q * K) });
        auto vV = buffer(vv, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oF = buffer(out_fin, range<1> { static_cast<size_t>(Q * Embedded_D) });

        int64_t m = Q;
        int64_t n = K;
        int64_t k = Embedded_D;

        int64_t ldq = k;
        int64_t ldk = k;
        int64_t ldt = m;

        float alpha = 1.0;
        float beta = 0.0;

        mkl::blas::column_major::gemm(                                                  // The following expression outlines the matrix operation being performed: alpha * A * B + beta * C
                oneapi::mkl::backend_selector<oneapi::mkl::backend::cublas>{ q },       // queue
                mkl::transpose::trans,                                                  // transa
                mkl::transpose::nontrans,                                               // transb
                m,                                                                      // m
                n,                                                                      // n
                k,                                                                      // k
                alpha,                                                                  // alpha
                qQ,                                                                     // A (dimensions: k * m) -> transposed
                ldq,                                                                    // lda
                kK,                                                                     // B (dimensions: k * n) -> non-transposed
                ldk,                                                                    // ldb
                beta,                                                                   // beta
                oO,                                                                     // C (dimensions: m * n)
                ldt                                                                     // ldc
        );

        q.wait();

        std::vector<double> sum_exp(Q, 0.0);
        buffer bsum_exp(sum_exp);

        try { q.submit([&](handler& cgh) {

                        accessor t_acc(oO, cgh, read_only);
                        accessor bsum_exp_acc(bsum_exp, cgh, read_write, no_init);

                        cgh.parallel_for(sycl::range<1>{static_cast<size_t>(Q)}, [=](sycl::id<1> index) {

                                double temp = 0.0;
                                for (int i = 0; i < K; i++) {
                                        temp += sycl::exp(t_acc[index[0] + (Q * i)]);
                                }
                                bsum_exp_acc[index[0]] += temp;

                        });
                });

        } catch (sycl::exception & e) {

                       std::cout << e.what() << std::endl;

    }

}

int main(void) {

        int Q = 2;
        int K = 4;
        int ED = 3;

        float* q_a = new float[Q * ED];
        float* k_a = new float[K * ED];
        float* out_a = new float[Q * K];
        float* v_a = new float[K * ED];
        float* out_f = new float[Q * ED];

        for (int i = 0; i < (Q * ED); i++) {
                q_a[i] = i + 1;
                out_f[i] = 0;
        }
        for (int i = 0; i < (K * ED); i++) {
                k_a[i] = i + 1;
                v_a[i] = i + 1;
        }

        for (int i = 0; i < (Q * K); i++) {
                out_a[i] = 0;
        }

        sdp_cublas_backend(q_a, Q, k_a, K, ED, out_a, v_a, out_f);

        std::cout << "The function ran successfully." << std::endl;

        return 0;

}

I compile this with the following and am able to run everything with no issues:

  • icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_70 -I/scratch/users/nkt8/libs/include standalone_test.cpp -o sa_test -L/scratch/users/nkt8/libs/lib -lonemkl -lonemkl_blas_cublas

The Python script used to run the code discussed below is as follows:

from ctypes import CDLL, POINTER, c_float, c_int

# TESTING:
lib = CDLL("~/mylib.so")["testing_func"]
cust = POINTER(c_float)
lib.restype = None
lib.argtypes = None

lib()

print("The FFI finished.")

However, as soon as I attempt to convert this to a shared library and run this through my Python script, I get the following error:

  • Native API failed. Native API returns: -46 (PI_ERROR_INVALID_KERNEL_NAME)

After searching this error, I found the following page:

The actual code that creates this error is as follows:

#include <sycl/sycl.hpp>
#include "oneapi/mkl/blas.hpp"
#include <vector>
#include <cmath>
#include <iostream>

using namespace sycl;
namespace mkl = oneapi::mkl;

extern "C" void sdp_cublas_backend(float* qq, int Q, float* kk, int K, int Embedded_D,  float* out, float* vv, float* out_fin) {
        queue q(gpu_selector_v);
        std::cout << "The device is: " << q.get_device().get_info<info::device::name>() << "\n";

        auto qQ = buffer(qq, range<1> { static_cast<size_t>(Q * Embedded_D) });
        auto kK = buffer(kk, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oO = buffer(out, range<1> { static_cast<size_t>(Q * K) });
        auto vV = buffer(vv, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oF = buffer(out_fin, range<1> { static_cast<size_t>(Q * Embedded_D) });

        int64_t m = Q;
        int64_t n = K;
        int64_t k = Embedded_D;

        int64_t ldq = k; 
        int64_t ldk = k; 
        int64_t ldt = m; 

        float alpha = 1.0;
        float beta = 0.0;

        mkl::blas::column_major::gemm(
                oneapi::mkl::backend_selector<oneapi::mkl::backend::cublas>{ q },       // queue
                mkl::transpose::trans,                                                  // transa
                mkl::transpose::nontrans,                                               // transb
                m,                                                                      // m
                n,                                                                      // n
                k,                                                                      // k
                alpha,                                                                  // alpha
                qQ,                                                                     // A (dimensions: k * m) -> transposed
                ldq,                                                                    // lda
                kK,                                                                     // B (dimensions: k * n) -> non-transposed
                ldk,                                                                    // ldb
                beta,                                                                   // beta
                oO,                                                                     // C (dimensions: m * n)
                ldt                                                                     // ldc
        );
        q.wait();

        std::vector<double> sum_exp(Q, 0.0);
        buffer bsum_exp(sum_exp);

        try { q.submit([&](handler& cgh) {

                        accessor t_acc(oO, cgh, read_only);
                        accessor bsum_exp_acc(bsum_exp, cgh, read_write, no_init);

                        cgh.parallel_for(sycl::range<1>{static_cast<size_t>(Q)}, [=](sycl::id<1> index) {

                                double temp = 0.0;
                                for (int i = 0; i < K; i++) {
                                        temp += sycl::exp(t_acc[index[0] + (Q * i)]);
                                }
                                bsum_exp_acc[index[0]] += temp;

                        });
                        std::cout << "We make it here." << std::endl;
                });

                std::cout << "We don't make it here." << std::endl;

        } catch (sycl::exception & e) {

                       std::cout << e.what() << std::endl;

    }

}

extern "C" void testing_func(void) {

        int Q = 2;
        int K = 4;
        int ED = 3;

        float* q_a = new float[Q * ED];
        float* k_a = new float[K * ED];
        float* out_a = new float[Q * K];
        float* v_a = new float[K * ED];
        float* out_f = new float[Q * ED];

        for (int i = 0; i < (Q * ED); i++) {
                q_a[i] = i + 1;
                out_f[i] = 0;
        }

        for (int i = 0; i < (K * ED); i++) {
                k_a[i] = i + 1;
                v_a[i] = i + 1;
        }

        for (int i = 0; i < (Q * K); i++) {
                out_a[i] = 0;
        }

        sdp_cublas_backend(q_a, Q, k_a, K, ED, out_a, v_a, out_f);

        std::cout << "The function completed 'successfully.'" << std::endl;
}

Within the try/catch section, I have added in some statements to print text so that we can know where it stops running. We output “We make it here.” within the queue submission statement, however we never actually manage to output “We don’t make it here.” after it completes. So, I’m guessing that the queue never actually is able to terminate in the correct manner in the submission block. This was compiled and linked in the following manner:

  • icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_70 -fPIC -I/scratch/users/nkt8/libs/include par_for_ffi.cpp -c
  • icpx -shared -o mylib.so par_for_ffi.o -L/scratch/users/nkt8/libs/lib -lonemkl -lonemkl_blas_cublas

I also created a version without the parallel_for that appears to run without any errors:

#include <sycl/sycl.hpp>
#include "oneapi/mkl/blas.hpp"
#include <vector>
#include <cmath>

#include <iostream>

using namespace sycl;
namespace mkl = oneapi::mkl;

extern "C" void sdp_cublas_backend(float* qq, int Q, float* kk, int K, int Embedded_D,  float* out, float* vv, float* out_fin) {

        queue q(gpu_selector_v);
        std::cout << "The device is: " << q.get_device().get_info<info::device::name>() << "\n";

        auto qQ = buffer(qq, range<1> { static_cast<size_t>(Q * Embedded_D) });
        auto kK = buffer(kk, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oO = buffer(out, range<1> { static_cast<size_t>(Q * K) });
        auto vV = buffer(vv, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oF = buffer(out_fin, range<1> { static_cast<size_t>(Q * Embedded_D) });

        int64_t m = Q;
        int64_t n = K;
        int64_t k = Embedded_D;

        int64_t ldq = k;
        int64_t ldk = k;
        int64_t ldt = m;

        float alpha = 1.0;
        float beta = 0.0;

        mkl::blas::column_major::gemm(
                oneapi::mkl::backend_selector<oneapi::mkl::backend::cublas>{ q },       // queue
                mkl::transpose::trans,                                                  // transa
                mkl::transpose::nontrans,                                               // transb
                m,                                                                      // m
                n,                                                                      // n
                k,                                                                      // k
                alpha,                                                                  // alpha
                qQ,                                                                     // A (dimensions: k * m) -> transposed
                ldq,                                                                    // lda
                kK,                                                                     // B (dimensions: k * n) -> non-transposed
                ldk,                                                                    // ldb
                beta,                                                                   // beta
                oO,                                                                     // C (dimensions: m * n)
                ldt                                                                     // ldc
        );

        q.wait();

}

extern "C" void testing_func(void) {

        int Q = 2;
        int K = 4;
        int ED = 3;

        float* q_a = new float[Q * ED];
        float* k_a = new float[K * ED];
        float* out_a = new float[Q * K];
        float* v_a = new float[K * ED];
        float* out_f = new float[Q * ED];

        for (int i = 0; i < (Q * ED); i++) {
                q_a[i] = i + 1;
                out_f[i] = 0;
        }

        for (int i = 0; i < (K * ED); i++) {
                k_a[i] = i + 1;
                v_a[i] = i + 1;
        }

        for (int i = 0; i < (Q * K); i++) {
                out_a[i] = 0;
        }

        sdp_cublas_backend(q_a, Q, k_a, K, ED, out_a, v_a, out_f);

        std::cout << "The function finished successfully." << std::endl;

}

This was compiled in the same manner as above and appears to have worked.

I’m not sure if I’m incorrectly creating the shared library or if I have gone wrong somewhere else. Does anyone else notice anything incorrect/questionable? Any feedback is welcome and I can always try to supply more information if any is needed.

Thank you!

Hi @ntomczak,

Thank you for the detailed report! Most things look normal, though I can’t say I’ve tried something as ambitious as this myself!

Can you try adding a kernel name to the parallel_for call in your code? Something like:

// outside function scope
class UniqueKernelName;

// inside the function...
h.parallel_for<UniqueKernelName>(...);

or whatever kernel name you want. My guess is that something isn’t being linked properly, and this might rectify that.

Let us know how you get on,
Duncan.

Hi @duncan ,

Thank you for your suggestion! I apologize for my delayed reply, I have been out for the last few days. I tried adding in a kernel name and the standalone executable appears to still work with the change, however I still get the same error as before with the shared library call from Python. I’ve attached the slightly altered code in case I’m doing something incorrect:

The additional kernel name is “UniqueSoftMax”

#include <sycl/sycl.hpp>
#include "oneapi/mkl/blas.hpp"
#include <vector>
#include <cmath>

#include <iostream>

using namespace sycl;
namespace mkl = oneapi::mkl;
class UniqueSoftMax;

extern "C" void sdp_cublas_backend(float* qq, int Q, float* kk, int K, int Embedded_D,  float* out, float* vv, float* out_fin) {

        queue q(gpu_selector_v);
        std::cout << "The device is: " << q.get_device().get_info<info::device::name>() << "\n";

        auto qQ = buffer(qq, range<1> { static_cast<size_t>(Q * Embedded_D) });
        auto kK = buffer(kk, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oO = buffer(out, range<1> { static_cast<size_t>(Q * K) });
        auto vV = buffer(vv, range<1> { static_cast<size_t>(K * Embedded_D) });
        auto oF = buffer(out_fin, range<1> { static_cast<size_t>(Q * Embedded_D) });

        int64_t m = Q;
        int64_t n = K;
        int64_t k = Embedded_D;

        int64_t ldq = k; 
        int64_t ldk = k; 
        int64_t ldt = m; 

        float alpha = 1.0;
        float beta = 0.0;

        mkl::blas::column_major::gemm(
                oneapi::mkl::backend_selector<oneapi::mkl::backend::cublas>{ q },       // queue
                mkl::transpose::trans,                                                  // transa
                mkl::transpose::nontrans,                                               // transb
                m,                                                                      // m
                n,                                                                      // n
                k,                                                                      // k
                alpha,                                                                  // alpha
                qQ,                                                                     // A (dimensions: k * m) -> transposed
                ldq,                                                                    // lda
                kK,                                                                     // B (dimensions: k * n) -> non-transposed
                ldk,                                                                    // ldb
                beta,                                                                   // beta
                oO,                                                                     // C (dimensions: m * n)
                ldt                                                                     // ldc
        );

        q.wait();

        std::vector<double> sum_exp(Q, 0.0);
        buffer bsum_exp(sum_exp);

        try { q.submit([&](handler& cgh) {

                        accessor t_acc(oO, cgh, read_only);
                        accessor bsum_exp_acc(bsum_exp, cgh, read_write, no_init);

                        cgh.parallel_for<UniqueSoftMax>(sycl::range<1>{static_cast<size_t>(Q)}, [=](sycl::id<1> index) {

                                double temp = 0.0;
                                for (int i = 0; i < K; i++) {
                                        temp += sycl::exp(t_acc[index[0] + (Q * i)]);
                                }
                                bsum_exp_acc[index[0]] += temp;

                        });
                        std::cout << "We make it here." << std::endl;
                });

                q.wait();

                std::cout << "We don't make it here." << std::endl;

        } catch (sycl::exception & e) {

                       std::cout << e.what() << std::endl;

    }

}

extern "C" void testing_func(void) {

        int Q = 2;
        int K = 4;
        int ED = 3;

        float* q_a = new float[Q * ED];
        float* k_a = new float[K * ED];
        float* out_a = new float[Q * K];
        float* v_a = new float[K * ED];
        float* out_f = new float[Q * ED];

        for (int i = 0; i < (Q * ED); i++) {
                q_a[i] = i + 1;
                out_f[i] = 0;
        }

        for (int i = 0; i < (K * ED); i++) {
                k_a[i] = i + 1;
                v_a[i] = i + 1;
        }

        for (int i = 0; i < (Q * K); i++) {
                out_a[i] = 0;
        }

        sdp_cublas_backend(q_a, Q, k_a, K, ED, out_a, v_a, out_f);

        std::cout << "The function completed 'successfully.'" << std::endl;
}

The one inconsistency with my previous example is that I am using an NVIDIA P100-PCIE-12GB here (just what the server had available at this moment in time, the driver version is identical to the above example). The only adjustment to the compilation method was changing the compute capability from 7.0 to 6.0.

Could there be some other linking approach that I should try? Thanks for your help!

@duncan quick update, I was reading around and came across this page (https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2023-2/use-sycl-shared-library-with-third-party.html) that talks about using SYCL shared libraries and noticed that they are specifying the target architecture in both the creation of the object file and when linking.

So, the following compilation method worked for me (on the V100 for this specific example):

  • icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_70 -fPIC -I/scratch/users/nkt8/libs/include par_for_ffi.cpp -c

  • icpx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_70 -fPIC -shared -o mylib.so par_for_ffi.o -L/scratch/users/nkt8/libs/lib -lonemkl -lonemkl_blas_cublas -lm

This works with and without the kernel name identifier that you suggested above, so I will keep using the kernel name just to be safe.

I’ve tested this method (changing the compute capabilities between examples) on the following NVIDIA GPUs and it has worked perfectly:

  • P100-PCIE-12GB

  • RTX 2080 Ti

  • V100-SXM2-32GB

  • A100-SXM4-80GB

Thank you very much for your help working through this!

1 Like

Hi @ntomczak,

That’s great, I’m glad it’s working! I’m sorry for not catching that the targets were missing from the linking command too, I really should have spotted it!

Duncan.

1 Like

No worries, I’d seen that page before and missed it myself; those small details can be hard to spot. I appreciate your help!