Profiling with nsys total execution time

Hi,
I am using NVIDIA A100 GPU with SYCL Codeplay plugin. Currently I am trying to get the total execution time of my kernels with the following :

nsys profile ./gpu/execution.gpu
nsys stats -r cuda_kern_exec_sum report1.nsys-rep 

So when I check the printed out report it shows each of my kernel with their instances and total execution times in nanoseconds.
A = 140,000,000 ns
B = 136,000,000 ns
C = 175,000,000 ns
D = 156,000,000 ns
The crucial part is, I have 4 kernels running in parallel so I don’t want to look at myself to all execution times to figure out the largest one to report since in the later stage there will be 10s kernels so it will be tough to deal with these.

Is there an option to just print total execution time of the initial kernel start and last kernel finish time ?

Hi @br-ko,
this is an interesting question. Unfortunately, I am not aware of any tool that would give you this information directly. There are two approaches you could take for this.

Approach 1 - instrumenting your code

You could use the built-in SYCL profiling event information. You’d need to:

  • add the sycl::property::queue::enable_profiling property to your queue
  • capture every event returned by queue::submit for your kernel launches into a vector
  • call sycl::event::wait on that vector to make sure all kernels finished (actually not needed because querying the start/end info waits on the kernel start/end to happen, according to the specification)
  • loop over the vector and query the info::event_profiling::command_start and command_end for each event to find the minimum and maximum timestamp.

See the event info specification. This approach uses only standard SYCL and therefore should work on any backend, not just CUDA.

Approach 2 - parsing raw nsys data

All the information is already available in the raw output of nsys, however there is no command-line tool to obtain it directly. You can export the data to many formats, including sqlite, json or text, and write your own tool to parse it. I was able to obtain the time it takes between the first kernel starts and the last kernel finishes in the entire program with these two lines:

nsys export -t text -o report1.txt report1.nsys-rep
grep -B9 '^  kernel {' report1.txt | grep 'start\|end' | awk 'BEGIN{s=1e12;e=0;}{s=(s<$2 ? s : $2);e=(e>$2 ? e : $2);}END{print (e-s)/1e6 " ms"}'

Breaking down the bash one-liner, this:

  • Finds kernel launches in the text file and prints the preceding 9 lines (which include timestamps)
  • Filters the resulting text to only print timestamps
  • Runs an awk script to find minimum and maximum value in the second column, which is the timestamp value in nanoseconds. The script prints the difference at the end, converted to milliseconds.

This solution is of course specific to nsys and its data formatting. I assume the formatting may change between nsys versions, so it’s not very flexible, but it’s quick and “free” (one line of code). I tested this using nsys version 2023.4.4.54-234433681190v0.

1 Like

Thank you for detailed answer again, I tried the approach 2 by your command line instruction and it worked like a charm for the newest version of nsys. I prefer approach 2 since it is more automated and deals with parallel kernels better since for example in my case I assumed 4 kernels to run in parallel but it turns out that inspecting the report from GUI showed me that they were executed sequentially so I could have got a wrong impression that my program got faster.

Thank you again! :slight_smile:

Hi @rbielski ,

I noticed that in my 2 consecutive runs ( not rebuilding again) the Approach 2 showed 9.50 ms and 75 ms execution times. Whereas, for the same runs, Approach 1 captured execution times are 3.88 ms and 3.84 ms.

So according to nsys reports I noticed that :

1.st Run :

 ** OS Runtime Summary (osrt_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)     Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  -------------  --------  -----------  ------------  ----------------------
     63.3    2,675,490,863         35  76,442,596.1  100,108,943.0   255,806  339,113,956  61,035,452.3  poll                  
     34.0    1,435,584,138        611   2,349,564.9    1,360,573.0     1,002   37,396,376   3,639,243.4  ioctl  

2nd Run :


 ** OS Runtime Summary (osrt_sum):

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)            Name         
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  ----------------------
     78.7    1,494,585,725         24  62,274,405.2  47,293,382.0   275,764  230,479,141  58,195,470.8  poll                  
     16.2      307,073,339        607     505,886.9      35,768.0     1,002   33,598,222   1,838,594.1  ioctl     

So ioctl takes around 4x more time which might be one of the reasons of this Approach 2 noted discrepancy between execution times.

I would assume there is a potential stalling between kernel start/synchronise and might have some gaps between the start of each iteration ( I am iterating these 3 kernels 14 times )

This brings us back to main question. What would be the approach here to make sure that the code does not have this kind of 8x total elapsed time differences ?

Thank you in advance for the help!

Edit : Run 3,4 also resulted in totally different execution times for approach 2 22ms and 35ms

Also, would it be more healthy way to combine all these kernels into a single kernel ?

Hi @br-ko,
the nsys “OS Runtime Summary (osrt_sum)” table shows what happens on the host during the entire duration of the program, also long before and long after any kernel execution. I wouldn’t worry about this report if you’re only interested in benchmarking the part of your program when kernels are executed. And even in this part, you usually wouldn’t be interested in measuring what happens on the host while kernels are executing. The host would normally be mostly just waiting for kernels to finish (the poll function), but it may be relevant what happens between kernels.

Your “approach 2” results seem less reproducible and always higher than “approach 1”, so I’m wondering if they’re correct. As I said, this solution was quite fragile, nsys version-dependent and more like a quick hack to get the info. I might have missed some cases or did something wrong in the bash one-liner, as I haven’t tested this thoroughly. It might be worth looking in the nsys GUI to see if the timeline confirms the numbers.

When it comes to reproducibility in general, it is expected that the first launch of every kernel comes with some overhead as the binary has to be copied to the device and the driver state / various handles need to be initialised. To create a reliable benchmark, you would normally execute the entire workload once as as “warm-up” run not counted towards the measured time, and then execute and measure it in a loop a large number of times. This should be all within the same program to make sure everything remains initialised and cached in memory, both on host and on device.

Another factor which may affect reproducibility is the configuration of the clock frequencies, both on host and on the device. For a truly reproducible environment, it would be good to make sure both CPU and GPU clock frequencies are fixed and not allowed to be adjusted dynamically by the system depending on the load (which is the default behaviour). Naturally, also make sure there are no other applications running on the machine which may steal some resources from your benchmarked application.

I think the answer to your question on merging the kernels would be: possibly yes, but it depends. If they all run over the same range, do simple computations and there is no device-global synchronisation needed (which is only possible by splitting kernels), then it would be almost always a good idea. However, there are some cases when this may go wrong, e.g. you might run into register pressure issues or thread divergence. This all depends on the specifics of your code, so I cannot give a general ‘yes’ or ‘no’ answer.

Hi @rbielski ,
Thank you for the explanation. The fragile execution times reported by the one-liner bash script is still valid and I think I found the cause of it. My interpretation for that is:

Approach 1 : focuses only the raw kernel events and some kind of ideal case where there are no waits
Approach 2 : takes into account the waiting times too

What I am struggling with is that, why out of nowhere in the 5th iteration this long wait time occured, I measured an it is 5ms. around half of of the execution time is wait time considering it lasted 12ms.

Also, my question would be is there a way with a GPU stream data rather than relying on wait command ? I don’t want my execution flow to be :

Kernel A → Wait → Kernel B → Wait → Kernel C

Because I found a way to implement a single kernel version for my code but now it is extremely slow ( 2s execution time ) which makes me think that you were right about the register pressure.