Performance FAQ and best practices

This topic collects common questions and advice on achieving the best performance with SYCL applications. In case the information collected here does not help you and/or you seek more specific advice, please open a new topic in the SYCL performance category. Additional guidance may be found in the Codeplay Developer portal pages about oneAPI for NVIDIA and AMD GPUs.

Getting started with performance optimisation

Avoid premature optimisation

First ensure the correctness of your application, then use profiling tools to gather information on potential optimisation opportunities. Rely on collected data and avoid making assumptions. Use target-specific profiling tools with SYCL applications:

  • NVIDIA Nsight Systems and NSight Compute for the NVIDIA GPU backend
  • ROCProfiler for the AMD GPU backend
  • Intel VTune for Intel GPU and CPU backends
  • Other CPU profiling tools like perf or callgrind with CPU backends

Find the slowest portion of your application

Use profiling tools to find out whether your application is spending the most time executing a specific offloaded kernel, or perhaps transferring memory between the host and the device. It could be also suffering from inefficient synchronisation, e.g. unnecessarily waiting for a kernel to finish executing before scheduling an independent memory transfer which could overlap with the kernel. Use the overview type of analysis in the profiling tools to gather global statistics and see a timeline of your application’s execution.

Common optimisations in SYCL

Indexing

The SYCL Specification (Sec. 4.9.1) mandates that:

When constructing multi-dimensional ids or ranges from integers, the elements are written such that the right-most element varies fastest in a linearization of the multi-dimensional space.

which follows the C++ array indexing convention. For this reason in SYCL the right-most dimension is mapped to the x-dimension as commonly referred to in CUDA or HIP. For example, a 3D workspace described with the number of work items Nwi_x, Nwi_y, Nwi_z and the number of work groups Nwg_x, Nwg_y, Nwg_z would be described by:

sycl::nd_range<3>{{Nwg_z*Nwi_z, Nwg_y*Nwi_y, Nwg_x*Nwi_x}, {Nwi_z, Nwi_y, Nwi_x}}

and a 3D memory array inside a kernel should be accessed with the corresponding global indices id_x, id_y, id_z as:

arr[id_z][id_y][id_x]

Failing to follow this convention may lead to performance issues due to non-coalesced global memory accesses or bank conflicts in local memory. Further details on linearization can be found in Sec. 3.11, Multi-dimensional objects and linearization of the SYCL Specification.

Compiling for a specific architecture

Compiling for a generic architecture means that:

  • The SYCL runtime will need to compile the code from a generic intermediate representation into specific device binary during execution (just-in-time compilation, JIT).
  • The compiler cannot make use of newer and possibly more efficient instructions which are only available in newer architectures than the default oldest-supported one.

Compiling for a specific architecture ahead of time (AOT compilation) can bring performance benefits due to these reasons. DPC++ accepts specific architecture flags for SYCL device code through:

-Xsycl-target-backend=<target> --offload-arch=<arch>

where <target> is one of the targets specified in the -fsycl-targets flag and <arch> is the target architecture. Examples:

  • -Xsycl-target-backend=nvptx64-nvidia-cuda --offload-arch=sm_80 for NVIDIA GPUs with the CUDA Compute Capability 8.0 like the NVIDIA A100
  • -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a for AMD GPUs with the gfx90a architecture like the AMD Instinct MI210

Inlining

The degree to which DPC++ automatically inlines functions is chosen to balance performance with compilation time across a range of devices. However programmers can also add the always_inline or noinline attributes on specific functions to force or prevent their inlining, for example:

[[clang::always_inline]] void function(...) {
  ...
}

...

q.submit([&](sycl::handler &cgh) {
    cgh.parallel_for(..., [=](...) {
      function(...);
    });
}

We advise that due care is taken when manually marking functions inline. Be aware that whilst manually inlining a given function may lead to greater performance on some devices, other devices may encounter drops in performance for the same code. We will continue to improve the compiler optimization heuristics in future releases.

Fast Math Builtins

The SYCL math builtins are defined to match the precision requirements of the equivalent OpenCL 1.2 math builtins, which may be unnecessarily precise for some applications, causing an avoidable loss of performance.

To address this, the SYCL specification provides a native version of a subset of the math functions (full list in Section 4.17.5, Math functions), which trades precision for performance. They are defined within the native namespace. For example, the native version of sycl::cos() is sycl::native::cos().

In general, if precision is not an issue, using the native variants may provide significant improvements, although do note that not all backends make use of the relaxed precision for all builtins.

Note that in DPC++ one of the effects of the -ffast-math compilation flag is swapping standard sycl:: math functions into the corresponding sycl::native:: ones if they are available. If there is no native version for a given math function the -ffast-math flag has no effect on it.

Loop Unrolling

While the compiler will handle some loop unrolling automatically, it can sometimes be beneficial to help the compiler by manually tuning the unrolling of the compute intensive loops in the device code, for example by using the unrolling pragma as follows:

#pragma unroll <unroll factor>
for( ... ) {
 ...
}

We advise that due care is taken when applying manual unrolling; be aware that whilst manual unrolling a given loop may lead to greater performance on some devices, other devices may encounter drops in performance for the same code. We will continue to improve the compiler optimization heuristics with each new release.

Alias Analysis

Alias Analysis can prove that two memory references do not alias each other. This may enable optimizations. By default, the compiler must assume that memory references do alias, if not proven otherwise by the alias analysis. It is also possible, however, to explicitly signal to the compiler that a memory reference inside the device code is not aliased. This can be achieved using respective keywords for the buffer/accessor and USM model.

For the former, one can add the no_alias property from the DPC++ oneapi extension to an accessor:

q.submit([&](sycl::handler &cgh) {
  sycl::accessor acc{..., sycl::ext::oneapi::accessor_property_list{sycl::ext::oneapi::no_alias}};
  ...
});

For the latter, the __restrict__ qualifier can be added to a pointer. Note that __restrict__ is non-standard C++ and may not behave consistently across SYCL implementations. For DPC++ only restrict-qualified device function (a function called from within a SYCL kernel) parameters will be taken into account.

For example:

void function(int *__restrict__ ptr) {
    ...
}

...
int *ptr = sycl::malloc_device<int>(..., q);
...
q.submit([&](sycl::handler &cgh) {
      cgh.parallel_for(..., [=](...) {
        function(ptr);
      });
});

A more brute force approach in DPC++ is to add the [[intel::kernel_args_restrict]] attribute to a kernel. This signals the compiler to ignore all possible alias dependencies between each of the USM pointers, or buffer accessors if that model is used. inside the kernel.

Example (buffer/accessor model):

q.submit([&](handler& cgh) {
      accessor in_accessor(in_buf, cgh, read_only);
      accessor out_accessor(out_buf, cgh, write_only);
        cgh.single_task<NoAliases>([=]() [[intel::kernel_args_restrict]] {
          for (int i = 0; i < N; i++)
              out_accessor[i] = in_accessor[i];
        });
});