Profiling with nsys total execution time

I am using NVIDIA A100 GPU with SYCL Codeplay plugin. Currently I am trying to get the total execution time of my kernels with the following :

nsys profile ./gpu/execution.gpu
nsys stats -r cuda_kern_exec_sum report1.nsys-rep 

So when I check the printed out report it shows each of my kernel with their instances and total execution times in nanoseconds.
A = 140,000,000 ns
B = 136,000,000 ns
C = 175,000,000 ns
D = 156,000,000 ns
The crucial part is, I have 4 kernels running in parallel so I don’t want to look at myself to all execution times to figure out the largest one to report since in the later stage there will be 10s kernels so it will be tough to deal with these.

Is there an option to just print total execution time of the initial kernel start and last kernel finish time ?

Hi @br-ko,
this is an interesting question. Unfortunately, I am not aware of any tool that would give you this information directly. There are two approaches you could take for this.

Approach 1 - instrumenting your code

You could use the built-in SYCL profiling event information. You’d need to:

  • add the sycl::property::queue::enable_profiling property to your queue
  • capture every event returned by queue::submit for your kernel launches into a vector
  • call sycl::event::wait on that vector to make sure all kernels finished (actually not needed because querying the start/end info waits on the kernel start/end to happen, according to the specification)
  • loop over the vector and query the info::event_profiling::command_start and command_end for each event to find the minimum and maximum timestamp.

See the event info specification. This approach uses only standard SYCL and therefore should work on any backend, not just CUDA.

Approach 2 - parsing raw nsys data

All the information is already available in the raw output of nsys, however there is no command-line tool to obtain it directly. You can export the data to many formats, including sqlite, json or text, and write your own tool to parse it. I was able to obtain the time it takes between the first kernel starts and the last kernel finishes in the entire program with these two lines:

nsys export -t text -o report1.txt report1.nsys-rep
grep -B9 '^  kernel {' report1.txt | grep 'start\|end' | awk 'BEGIN{s=1e12;e=0;}{s=(s<$2 ? s : $2);e=(e>$2 ? e : $2);}END{print (e-s)/1e6 " ms"}'

Breaking down the bash one-liner, this:

  • Finds kernel launches in the text file and prints the preceding 9 lines (which include timestamps)
  • Filters the resulting text to only print timestamps
  • Runs an awk script to find minimum and maximum value in the second column, which is the timestamp value in nanoseconds. The script prints the difference at the end, converted to milliseconds.

This solution is of course specific to nsys and its data formatting. I assume the formatting may change between nsys versions, so it’s not very flexible, but it’s quick and “free” (one line of code). I tested this using nsys version 2023.4.4.54-234433681190v0.

1 Like

Thank you for detailed answer again, I tried the approach 2 by your command line instruction and it worked like a charm for the newest version of nsys. I prefer approach 2 since it is more automated and deals with parallel kernels better since for example in my case I assumed 4 kernels to run in parallel but it turns out that inspecting the report from GUI showed me that they were executed sequentially so I could have got a wrong impression that my program got faster.

Thank you again! :slight_smile: