The Mali-G710: Doubling up per-core performance

As a continuation of the Valhall GPU architecture, the cornerstone characteristics of the new G710’s execution engines are similar and roughly the same as what we’ve covered in the past generation Mali-G77 and Mali-G78.

Amongst the larger changes we saw with Valhall was the shift from a wavefront/warp size of 8 towards 16, with dual datapaths (clusters) per execution engine, resulting in a 32 FMA/core design that we saw in the G77 and G78.

The ISA is said to have seen larger improvements that was designed with new modern APIs such as Vulkan – it’s always quite hard to quantify the impact such changes have on the overall performance and efficiency of a GPU.

What’s new in the Mali-G710 is the addition of a second execution engine, effectively doubling up on the compute performance per shader core of the Valhall architecture. In a sense, Arm here is re-adopting some of its scaling means that we had seen in past generation Mali architectures, such as compared to when the Mali-G76 had for example three execution engines per shader core.

In the above slide, the “8x” and “4x” metrics are in regards to the throughput per cycle per core, and we can see by the metrics that other functional blocks of the GPU have also doubled up in terms of throughput to keep up with the doubled up compute execution throughput of the execution engines.

The new G710 includes a brand-new texture unit that is now able to handle up to 8 bilinear texels per clock, and Arm has generally optimised the new design to be significantly more area efficient, giving the new TMU a +50% performance density advantage.

Within the execution engine Arm continues to employ two processing units or clusters of processing elements, and in that regard, we don’t see that much difference between the generations, however if we look deeper into the actual processing unit there are changes to the blocks:

In the simplest and fundamental explanation, what we’re seeing is a shift from a single instance of 16-wide (warp wide) processing elements and execution units, to four instances of 4-wide execution units. The throughput between the designs doesn’t change, but the new microarchitecture gives more dedicated resources to the processing elements and allows for better structing for better efficiency.

Overall, the new execution engine design doubles up the FMA’s per clock per core, which is somewhat obvious, but also has the benefit of lowering the energy distribution within the shader core from the execution engine by 20%.

A further very large highlight of the G710 is the replacement of the traditional “Job Manager” with the new “Command Stream Frontend”, which handles scheduling and handling of draw-calls. The CSF introduces a new CPU of undisclosed nature, and for the first time will now also introduce a firmware layer to Mali GPUs.

The goals of the design is achieving more flexible and scalable performance for more complex graphical workloads while at the same time improving on system CPU power efficiency by reducing driver overhead by providing it with a very light weight submission path. It helps for simplified support of API features such as state inheritance and secondary buffers, and handling timing sensitive applications such as VR or time-warp applications. Synchronisation events also greatly benefit from the move closer to the hardware and the reduction of latency that this enables.

The firmware is closely couples to the hardware and handles requests from the host, or command buffer completion notifications, reduces overhead of things such as protected entry exit, or even allows for emulation of API features that don’t yet exist in the hardware through additional instructions.

The new hardware has been redesigned from the ground-up to be able to keep up with modern content and allow for the throughput of job submission into other GPU units. Arm here claims that the new CSF allows for up to 5 million drawcalls per second.

Overall, the new G710 microarchitecture seems very interesting and in particular seems to want to address some API overhead related weaknesses of Arm’s Mali GPUs. How this plays out remains to be seen, but from the advertised performance and power efficiency gains of 20% this generation, it seems like a solid improvement, although in these figures wouldn’t be quite sufficient to alter the competitive landscape in the mobile market.

The Mali-G610 is the same microarchitecture as the G710, only with a different name with core configurations lower than 7 cores.

Third Generation of Valhall Mali GPUs The Mali-G510 & G310: Attacking the low-end
Comments Locked


View All Comments

  • mode_13h - Thursday, May 27, 2021 - link

    > it will be a few years on top of that before we see any nvidia IP offered / licensed like ARM

    That already happened. It was never dependent on the ARM acquisition.

    > Just because they are acquiring ARM does not mean magically Nvidia's GPU IP will replace

    Agreed. It's a possibility, though. The G710-tier seems most at risk.
  • edzieba - Wednesday, May 26, 2021 - link

    How much of the time writing the article was spent on un-autocorrecting "Valhalla" to "Valhall"?
  • tkSteveFOX - Thursday, May 27, 2021 - link

    Whitechapel will have Mali along with the high-end MTK chip on 5nm coming in Q4.
    Overall,ARM CPU and GPU designs have hit a grind, as adopters can't use the most advanced node (TSMC 5nm) and had to settle for Sammy's inferior 5nm leading to a generation of flagship chips that are hardly an improvement over TSMCs 7nm SD865/+/870 in real world tasks.
  • regsEx - Thursday, May 27, 2021 - link

    What will happen to Mali after merge with Nvidia?
  • Spunjji - Friday, May 28, 2021 - link

    It wouldn't surprise me if they kept it around for at least a while as the low-end GPU in their line-up - "just enough" for the customers who care more about area than performance, with GeForce being pushed out to the high-end.
  • mode_13h - Saturday, May 29, 2021 - link

  • 5j3rul3 - Thursday, May 27, 2021 - link

    When Mali can support HW Ray Tracing, MLSS, HDR Gaming?
  • Fedposter - Monday, June 28, 2021 - link

    Can't arm just make discrete gpus out of these like Intel did with the DG1?
  • vladx - Monday, September 6, 2021 - link

    ARM is an IP designer, not a hardware manufacturer
  • yeeeeman - Friday, June 24, 2022 - link

    when do we get the next gen unveil?

Log in

Don't have an account? Sign up now