I have multiple kernels which are spawn in a flow graph on multiple devices. For load balancing considerations, these kernels need to be run on different devices decided by upper layer. Is there any example on how to select a device before a Sycl queue is setup? figuratively, I am looking for this kind of flow:
if (cpu_device) {
cl::sycl::cpu_selector device_selector;
} else if (gpu_device) {
cl::sycl::gpu_selector device_selector;
} else {
cl::sycl::default_selector device_selector;
}
cl::sycl::queue sycl_queue (device_delector);
thanks in advance.
Hi @farshad.akhbari, the simple answer is that the selectors are part of an inheritance hierarchy so you could use an std::unique_ptr<sycl::device_selector>
to store whatever selector you eventually choose. That said, it is almost always better to create queues once on application startup and reuse them if possible, so I would recommend creating and caching a queue per-device when initialising your graph, then submitting work based on what device your upper layers require.
If you must create a queue each time, which I would strongly recommend against, you could also use a custom selector, like in some sample code we have on GitHub.
I hope this helps,
Duncan.
Hi Duncan,
The intent is really not to recreate queues but to set them up in the object’s constructor (flow graph
object that is). I think the solution I am looking for is to use the std::unique_ptr but not sure how it should be properly used. I see a null pointer at runtime when I try this. Any working example I can compare mine against?
Thanks,
Farshad.
You could do something like:
auto selector = std::make_unique<sycl::device_selector*>(new sycl::default_selector{});
if (use_cpu_device) {
selector = std::make_unique<sycl::device_selector*>(new sycl::cpu_selector{});
else if /*and so on*/ {
}
That said, if you’re putting that much logic into it, you’re honestly as well wrapping it all into a custom selector, much like the sample shows.
You mean
unique_ptr below? I am not certain you
sycl::queue can dynamically find the right constructor:
error: no matching constructor for initialization of ‘cl::sycl::queue’
cl::sycl::queue
sycl_queue(device_selector);
^ ~~~~~~~~~~~~~~~
/CL/sycl/queue.hpp:29:12: note: candidate constructor not viable: no known conversion from ‘std::unique_ptr<cl::sycl::device_selector
*, std::default_delete<cl::sycl::device_selector *> >’ to ‘const cl::sycl::property_list’ for 1st argument
explicit queue(const property_list &propList = {})
It’s just like a pointer, you can just dereference it to get the right type:
sycl::queue queue(*selector);
Duncan,
I need a complete solution. This is a serious DX issue. Can you send me a working sample code after verification?
Regards,
Farshad.
I’ve thought a little about your situation, and tried to come up with a different approach. If you are using every device on the system, I imagine it is easier to enumerate them and create a queue from each directly, which are then used in this lower layer.
auto platforms = sycl::platform::get_platforms();
std::vector<sycl::queue> queues;
for (auto plat : platforms) {
for (auto dev : plat.get_devices()) {
queues.push_back(sycl::queue{dev});
}
}
This will give you a vector of queues which you can distribute among these lower layers.
Otherwise, the selector example I linked in a previous reply will show you how to encapsulate the logic of which device to choose inside a single class that you can use in your lower layers.
Oh my lord!!
There is a known platform. CL capabilities in the platform is already known. The dilemma is to allocate queues as the upper layer spawns work. Since we are in runtime, query of devices at this point is very costly since SYCL kernel execution is half way
down the pipeline. Unless I move it to the very beginning which I can do (yet a different challenge). At the FG point of entry, I need to send kernel A to CPU and kernels B, C and D to GPU and another kernel to FPGA device. I hope there is a solution that
would help me setup proper devices and their respective queues with minimal complexity. If there is setup and initialization time I can move it up the stack to avoid added latency.
Given above info, any pointer?
Regards,
Farshad.
Yes, that’s helpful. Queue creation is definitely one of the slower steps in a SYCL program from my experience. If you know exactly which devices will be executing the work, I would recommend moving the queue construction earlier. Could you store the queues as members of the lower layers? That way the queue will be ready to submit work to at the points you need it, and since you have a constrained number of devices to begin with it shouldn’t be too hard to do. I don’t think we’ve got any larger samples showing multiple device queues.