AMD MI210 vs MI300X performance difference

Hi,
I was experimenting MI210 and MI300X. Noticed that the MI210 performs around 30% better than MI300X for the same code.

I am using the latest nightly build from intel/llvm repository and enabled xnack via

export HSA_XNACK=1

offload-arch=gfx90a and offload-arch=gfx942 are used for offload architectures for compile command.

Is this a known behaviour or is there anything I could do to enhance the MI300X performance ?

Hi @br-ko,
are you testing both GPUs within the same host, or are the hosts different? Is your code maximising kernel execution time in the measurement, or is your timing also sensitive to host overheads when launching kernels?

Perhaps you could profile your code with rocprof --sys-trace ./your-app and see if the kernel execution times differ between the two GPUs and if indeed kernels run faster on MI210? We can try to debug further if you could confirm this is the case.

Hi @rbielski ,
The hosts are different CPUs since the systems are different and I am capturing the execution time via executing std::chrono::steady_clock::now() at the start and end to find the overall time including the kernel launch overheads too. Would different host affect the timings ?

I will try to execute the rocprof to observe further differences but currently lost the access to the system with MI210

These are large and fast modern data-centre GPUs. Some existing benchmarks aren’t filling up such GPUs with enough work to make the kernel execution time the majority of the measurement. If the GPU can execute the kernel in a few microseconds and 50% of your measured time is spent in host operations before/after the kernel execution, it can easily introduce a 25% difference when comparing different CPUs. Things like CPU turbo boost also come into play when benchmarking the host part and can even cause big differences run-to-run on the same hardware.

Unless you’re specifically looking for performance in low-latency applications, we would advise ensuring that the benchmark computations are large enough to make the kernel time at least 90% of the total measured time.

Without knowing more about your specific benchmark it’s hard to tell if it may be affected by this, but I’m sure the rocprof results will clarify this.