Cannot profile SYCL kernel using host device

Hi!

I tried measuring the runtime of a kernel on the host device, but the events always return 0. This are the utility functions involved:

 namespace util
{   
    template <cl::sycl::info::event_profiling From,
              cl::sycl::info::event_profiling To,
              typename Dur = std::chrono::nanoseconds>
    auto get_duration(cl::sycl::event ev)
    {
        using namespace std::chrono;
        using cl::sycl::info::event_profiling;

        ev.wait_and_throw();
        
        return duration_cast<Dur>(nanoseconds{ ev.get_profiling_info<event_profiling::command_end>() -
                                               ev.get_profiling_info<event_profiling::command_start>() } );
    }

    template <typename F>
    auto get_avarage_duration(F&& f, std::size_t count)
    {
        using duration_type = decltype(f());

        duration_type dur{0};

        for(std::size_t i = 0 ; i < count ; ++i)
            dur += f();
        
        return dur / count;
    }
}

And it is used like this:

auto reader_on_the_fly_init = util::get_avarage_duration([&]()
{
    return util::get_duration<cl::sycl::info::event_profiling::command_start,
                              cl::sycl::info::event_profiling::command_end,
                              std::chrono::microseconds>(queue.submit([&](cl::sycl::handler& cgh)
    {
        auto in  = buf_in.get_access<cl::sycl::access::mode::read>(cgh);
        auto out = buf_out.get_access<cl::sycl::access::mode::write>(cgh);

        cgh.parallel_for<kernels::reader_on_the_fly_init>(in.get_range(), [=](cl::sycl::item<2> item)
        {
            cl::sycl::float4 old = in[item];

            out[item] = some_calc(old);
        });
    }));
}, runs);

This seems to work for OpenCL device types, but not for host execution. Host enqueue in ComputeCpp 1.1.4 is synchronous (no problem with that), but the event does not hold any profiling info, all time stamps are 0. The SYCL 1.2.1 spec under chapter 3.11 says:

Any kernel enqueued to a host queue executes on the host device according to the same rules as the OpenCL devices.

This is non-conforming IMHO. I understand “dispatching” kernels on the host is a flaky concept, however the SYCL spec suggests that this scenario does not require any special handling, especially given how cl::sycl::device::get_info<cl::sycl::info::device::queue_profiling>() returns true for the host device. Either let the users know that profiling on host is not available so they can resort to profiling through regular timers, or fix host profiling.

Hi,
I have taken a quote from this article on profiling:
“if running on the host device, the ComputeCpp implementation cannot provide us with any profiling information from the SYCL events (use standard host timer for that purpose instead).”
Thanks,
Rod.