Beyond the limits

Authors: devtechprofile and user "blautemple" from PCGHX forum

When it comes to hardware, you are bound by some limits, because one specific part is always the so-called weakest link in the chain. GPU limit and CPU limit - these are clearly separable states. This seems to be the general opinion on the subject when you try to look at the performance of a certain component. However, the fact that these states tend to flow into one another should not change the methodology of how to ideally test GPUs and CPUs. That means you have to try to remove the impact of all other components from the equation as much as possible when testing a particular component. Avoiding bottlenecks is an important basis for being able to make relative statements about performance. But what does bottleneck mean anyway? What does it entail? Is it really enough to simply look at the CPU performance at absolute CPU limitation, as we have already done in another article (CapFrameX - The Battle of the Giants - Blog) or is there perhaps something more behind it? Furthermore, what about GPU limitation?

We have taken a closer look at other aspects which can have an impact on the system performance when being GPU limited. What effects does HAGS have? What about the different platforms: Intel vs. AMD? What about PCIe Gen 4 versus Gen 3?

During the preparation and testing phases for this article, the idea came up to take a closer look at PCIe data transfer latency. This is how we ended up creating a small tool which we have used here.

Test systems

In this test, the current mainstream platforms from Intel and AMD compete against each other. Comet-Lake on Socket 1200 and Vermeer on AM4. The systems are set up as follows:

Intel
CPU: Intel Core i9 10900K
Board: Asus ROG Maximus XII Hero
RAM: 32GB DDR4 4000MHz CL16-16-36-280-2T
GPU: ASUS ROG Strix GeForce RTX 3090 OC

AMD
CPU: AMD Ryzen 9 5900X
Board: Gigabyte X570 Aorus Master
RAM: 32GB DDR4 3800MHz CL16-15-30-280-1T
GPU: ASUS ROG Strix GeForce RTX 3090 OC

In addition, the subtimings are optimized on both systems, so CPU limitation starting with 1440p should largely be eliminated. All components (CPUs, GPU, RAM) were cooled with a strong water cooling unit including a Mo-Ra 420.

Windows 10 version 20H2 (build 19042.685) was used and the OS was configured absolutely identically on both systems. Game Mode was also enabled.

Test scenarios

The following test scenarios are examined.

Intel
PCIe3 - HAGS On - 1440p
PCIe3 - HAGS On - 2160p
PCIe3 - HAGS Off - 1440p
PCIe3 - HAGS Off - 2160p

AMD
PCIe3 - HAGS On - 1440p
PCIe3 - HAGS On - 2160p
PCIe3 - HAGS Off - 1440p
PCIe3 - HAGS Off - 2160p
PCIe4 - HAGS On - 1440p
PCIe4 - HAGS On - 2160p
PCIe4 - HAGS Off - 1440p
PCIe4 - HAGS Off - 2160p

Benchmark suite

  • Borderlands 3
  • Control
  • F1 2020
  • Ghost Recon Breakpoint

The game selection was chosen randomly, the only requirement being that the GPU limit would start showing up with resolutions higher than 1440p. The test scenes are based on the GPU benchmark suite from PCGH. Special thanks to PCGH for the high transparency of the reviews.

Results

1440p
grafik

Update (01/05/20): The Borderlands 3 results had to be corrected because the Game Mode under Windows was not activated on the Intel system.

The difference between HAGS on and off was quickly analyzed. With the exception of Borderlands 3 there is simply no difference at all. The rest of the differences we can see lie within the measurement accuracy.

If you look at the PCIe Gen 3 and 4 results of the AMD system, there is indeed a significant difference in some cases. Borderlands 3 hardly reacts, whereas a title like Control increases by up to 8% in P1. It should be noted that the 1% percentile is a fairly robust metric in terms of repeatability.

However, it is now interesting to note that the Intel system is able to clearly outperform the other system in Borderlands 3 and F1 (i.e. despite the GPU limit and the apparent limitation of PCIe Gen 3). Otherwise, the 10900K is on par with or slightly above the AMD system with PCIe Gen 4.

The sensor stats from CapFrameX reveal another interesting detail. While the GPU utilization is identical at about 100%, the power consumption of the graphics card on the Intel system is a bit higher.

4K grafik

At 4K, values continue to converge overall. HAGS still has no influence. Different PCIe configurations hardly show any differences. The Intel system is on par with the AMD system and can only distinguish itself measurably in Borderlands 3.

PCIe data transfer

The results have inspired us to analyze the PCIe performance of the systems more closely. This resulted in a small tool that measures the data transfer time of relatively small data packets. The packet size starts at 1 KiB and ends at 2 MiB. The smaller the data packet, the more important the latency becomes and rightfully so, the larger the data packet, the more important the bandwidth becomes. We can expect the AMD system with PCIe Gen 4 to perform significantly better with the larger data packets. It should be explicitly noted that the tool does not measure the pure latency of the PCIe connection. The values are too high for that. The reason is not entirely clear yet. In essence, the benchmark is based on a copy process from the host to the device and back to the host. The Intel system has an average transfer time of about 115 µs for the 1 KiB packets. According to this latency benchmark on GitHub for Linux, the average latency is 1500 ns = 1.5 µs. The benchmark used here is based on CUDA instructions, the benchmark on GitHub uses BAR, which cannot be used yet under Windows due to some current hardware limitations. Even a further reduction of the packet size would hardly change the best values. Presumably, an overhead of the CUDA driver is responsible for the difference. However, this is not certain and should be taken with a grain of salt. The values shown here should therefore not be considered absolute, but relative to other systems.

The data transfer is differentiated between pageable and pinned (non-pageable) memory. Non-pageable memory cannot be swapped to the hard disk, for example, and also offers performance advantages.

Pageable data transfer

grafik

An important piece of information with regards to the scaling of the times on the y-axis: the root of the values was taken to better distinguish the differences in smaller value ranges. The Intel system is significantly faster at the 1 KiB data packet, which could indicate that the latencies turn out better. The higher bandwidth of the PCIe Gen 4 connection takes effect from 8 KiB.

Pinned data transfer

grafik

Here, too, the times were transformed with the root to better distinguish smaller values. The curves behave similarly to the pageable data transfer. The Intel system delivers better values for the smallest data packets. Towards the top, the higher bandwidth of PCIe Gen 4 wins.

Pageable + Pinned untransformed
In the following the values are untransformed and given in table form.

grafik

Conclusion

The results may surprise some, but perhaps not everyone. As written in the introduction, CPU and GPU limitation merge smoothly, but actually CPU performance should not matter in the tests performed. Explicit attention was paid to a strong GPU limitation. The PCIe data transfer speed test showed that the Intel system is faster with small data packets, which could explain the cause of the performance differences despite a strong GPU limit. The actual cause of the performance differences is basically unclear. However, it turns out that an Intel platform is ultimately still a better choice for GPU tests despite a limitation to PCIe Gen 3.

This also leads to the realization that the CPU/platform might also be important, especially beyond PCIe at higher resolutions. The clear division into different limiting states is not as trivial as one might have thought.

For the normal user, these differences are of course not further relevant since they are much too small. However, the question still arises: what does Intel do differently better than AMD and how will this look like for the upcoming PCIe4 platforms?

Tags:AMDIntelGPU limitCPU limitPCIedata transferHAGSlatency

AMD, Intel, GPU limit, CPU limit, PCIe, data transfer, HAGS, latency
CapFrameX Frametime Analysis Software
Back to overview

Featured Blogposts

how capframex calculates fps
The challenge of displaying performance metrics as FPS
6/27/20

Why does my analysis show fps values that are lower than what I've seen in the game?

Continue reading
post teaser thumbnail
metrics explained
Explanation of different performance metrics
5/31/20

Frametimes, FPS, median, Percentiles, x%-low ?

Continue reading
post teaser thumbnail

Sponsors
841fd6cf-f6df-4504-ac5b-a627cafc4084