Common optimisations in SYCL
Indexing
The SYCL Specification (Sec. 4.9.1) mandates that:
When constructing multi-dimensional ids or ranges from integers, the elements are written such that the right-most element varies fastest in a linearization of the multi-dimensional space.
which follows the C++ array indexing convention. For this reason in SYCL the right-most dimension is mapped to the x-dimension as commonly referred to in CUDA or HIP. For example, a 3D workspace described with the number of work items Nwi_x, Nwi_y, Nwi_z and the number of work groups Nwg_x, Nwg_y, Nwg_z would be described by:
sycl::nd_range<3>{{Nwg_z*Nwi_z, Nwg_y*Nwi_y, Nwg_x*Nwi_x}, {Nwi_z, Nwi_y, Nwi_x}}
and a 3D memory array inside a kernel should be accessed with the corresponding global indices id_x, id_y, id_z as:
arr[id_z][id_y][id_x]
Failing to follow this convention may lead to performance issues due to non-coalesced global memory accesses or bank conflicts in local memory. Further details on linearization can be found in Sec. 3.11, Multi-dimensional objects and linearization of the SYCL Specification.
Compiling for a specific architecture
Compiling for a generic architecture means that:
- The SYCL runtime will need to compile the code from a generic intermediate representation into specific device binary during execution (just-in-time compilation, JIT).
- The compiler cannot make use of newer and possibly more efficient instructions which are only available in newer architectures than the default oldest-supported one.
Compiling for a specific architecture ahead of time (AOT compilation) can bring performance benefits due to these reasons. DPC++ accepts specific architecture flags for SYCL device code through:
-Xsycl-target-backend=<target> --offload-arch=<arch>
where <target>
is one of the targets specified in the -fsycl-targets
flag and <arch>
is the target architecture. Examples:
-Xsycl-target-backend=nvptx64-nvidia-cuda --offload-arch=sm_80
for NVIDIA GPUs with the CUDA Compute Capability 8.0 like the NVIDIA A100
-Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a
for AMD GPUs with the gfx90a architecture like the AMD Instinct MI210
Inlining
The degree to which DPC++ automatically inlines functions is chosen to balance performance with compilation time across a range of devices. However programmers can also add the always_inline
or noinline
attributes on specific functions to force or prevent their inlining, for example:
[[clang::always_inline]] void function(...) {
...
}
...
q.submit([&](sycl::handler &cgh) {
cgh.parallel_for(..., [=](...) {
function(...);
});
}
We advise that due care is taken when manually marking functions inline. Be aware that whilst manually inlining a given function may lead to greater performance on some devices, other devices may encounter drops in performance for the same code. We will continue to improve the compiler optimization heuristics in future releases.
Fast Math Builtins
The SYCL math builtins are defined to match the precision requirements of the equivalent OpenCL 1.2 math builtins, which may be unnecessarily precise for some applications, causing an avoidable loss of performance.
To address this, the SYCL specification provides a native version of a subset of the math functions (full list in Section 4.17.5, Math functions), which trades precision for performance. They are defined within the native
namespace. For example, the native version of sycl::cos()
is sycl::native::cos()
.
In general, if precision is not an issue, using the native variants may provide significant improvements, although do note that not all backends make use of the relaxed precision for all builtins.
Note that in DPC++ one of the effects of the -ffast-math
compilation flag is swapping standard sycl::
math functions into the corresponding sycl::native::
ones if they are available. If there is no native version for a given math function the -ffast-math
flag has no effect on it.
Loop Unrolling
While the compiler will handle some loop unrolling automatically, it can sometimes be beneficial to help the compiler by manually tuning the unrolling of the compute intensive loops in the device code, for example by using the unrolling pragma as follows:
#pragma unroll <unroll factor>
for( ... ) {
...
}
We advise that due care is taken when applying manual unrolling; be aware that whilst manual unrolling a given loop may lead to greater performance on some devices, other devices may encounter drops in performance for the same code. We will continue to improve the compiler optimization heuristics with each new release.
Alias Analysis
Alias Analysis can prove that two memory references do not alias each other. This may enable optimizations. By default, the compiler must assume that memory references do alias, if not proven otherwise by the alias analysis. It is also possible, however, to explicitly signal to the compiler that a memory reference inside the device code is not aliased. This can be achieved using respective keywords for the buffer/accessor and USM model.
For the former, one can add the no_alias
property from the DPC++ oneapi
extension to an accessor:
q.submit([&](sycl::handler &cgh) {
sycl::accessor acc{..., sycl::ext::oneapi::accessor_property_list{sycl::ext::oneapi::no_alias}};
...
});
For the latter, the __restrict__
qualifier can be added to a pointer. Note that __restrict__
is non-standard C++ and may not behave consistently across SYCL implementations. For DPC++ only restrict-qualified device function (a function called from within a SYCL kernel) parameters will be taken into account.
For example:
void function(int *__restrict__ ptr) {
...
}
...
int *ptr = sycl::malloc_device<int>(..., q);
...
q.submit([&](sycl::handler &cgh) {
cgh.parallel_for(..., [=](...) {
function(ptr);
});
});
A more brute force approach in DPC++ is to add the [[intel::kernel_args_restrict]]
attribute to a kernel. This signals the compiler to ignore all possible alias dependencies between each of the USM pointers, or buffer accessors if that model is used. inside the kernel.
Example (buffer/accessor model):
q.submit([&](handler& cgh) {
accessor in_accessor(in_buf, cgh, read_only);
accessor out_accessor(out_buf, cgh, write_only);
cgh.single_task<NoAliases>([=]() [[intel::kernel_args_restrict]] {
for (int i = 0; i < N; i++)
out_accessor[i] = in_accessor[i];
});
});