Sunny Cove Microarchitecture: A Peek At the Back End

One of the best parts when a new microarchitecture is announced is learning how it all works and how it has changed. Intel has been on close variants of the Skylake core since Skylake first launched in 2015, which has carried through Kaby Lake, Coffee Lake, and Coffee Lake refresh – and so getting an update to his old core design is more than welcome. While I applaud Intel for getting something new to put in front of our faces, the level of information was unfortunately less than complete. For reasons of ‘we’ll discuss more nearer the launch’, Intel was only set on discussing the back end of the microarchitecture design in any detail. In core design parlance, this means everything beyond the dispatch and order queues.

Intel framed its microarchitecture updates into two different sections: general purpose performance increases and special purpose performance increases. The company described the general purpose updates essentially as raw IPC (instructions per clock) throughput or an increase in frequency. An increase in either of these two leads to a performance increase no matter the code, at least as long as it is compute bound. Frequency is often a function of the implementation and process, whereas IPC increases can come from cores being wider (more executing instructions per clock), deeper (more parallelism per clock), and smarter (better data delivery through the front end).

The other type of performance increase is the special purpose kind, which means that certain workloads used in specific scenarios can be improved through additional acceleration methods, such as dedicated IP or specialized instructions.

Sunny Cove is set to have both.

On the back end of the microarchitecture design, Intel has done several things including increasing cache sizes, making the execution of the core wider, increasing L1 store bandwidth, and other things.

Starting with the cache improvements, the Sunny Cove core back end has a 48 KB L1 Data cache, upgraded from 32 KB. Normally cache misses decrease by a factor of a square root of the proportional size when the cache is increased, so in this case we should expect a 22% reduction in L1-Data cache misses. We are told that the L2 cache size has also been increased in both the Core and Xeon variants of Sunny Cove (256 KB and 1 MB respectively), although to exactly what size will be revealed at a later date.

While not back-end, the micro-op (uOp) cache has increased from a 2048-entry design, although exactly to what Intel is not saying at this point. The second level TLB size has also increased, which will help with machine address translation of recent addresses – normally this is increased when more lookups need to be kept and stored, which means that Intel identified a number of use cases where recent addresses had been evicted shortly before being needed.

From the diagram we can see some other changes, such as the increase in execution ports from eight to ten, allowing more instructions to come out of the schedulers at one time. Port 4 and Port 9 is linked to a cycling data store, doubling that bandwidth, but also the AGU store capabilities have doubled, for which the increased L1-D size will help. Previously on Skylake there was a potential bottleneck when all three AGUs could attempt a store, however only enough bandwidth was present for one per clock cycle. Load performance is unaffected.  There is also an allocation increase, moving from four to five. This means that the dispatch out of the reorder buffer can now hit five instructions per cycle, rather than four – however given that the ops/cycle out of the Skylake decoders and uOp cache was four-to-five (depending on certain factors), it will be interesting to see how the front end has changed as a result.

There are also substantial changes in the execution ports between Skylake and Sunny Cove, which is best represented in a diagram and a table:

Back-End Execution Resources
Skylake Port Sunny Cove
Integer
  JMP Shift ALU 0 ALU LEA Shift JMP
    LEA ALU 1 ALU LEA Mul iDIV
  Mul LEA ALU 5 ALU LEA MulHi  
  JMP Shift ALU 6 ALU LEA Shift JMP
Vector / FP
fpDIV Shift ALU FMA 0 FMA ALU Shift fpDIV
  Shift ALU FMA 1 FMA ALU Shift Shuffle
  Shuffle ALU FMA 5 FMA ALU Shuffle  

The parts highlighted in bold are different in the Sunny Cove core.

We see that Intel has equipped the Integer section of the core with more LEA units to help with memory addressing calculations, possibly to help improve performance loss with security mitigations that require frequent memory calculations or to help with high-performance array code with constant offsets. Port 1 gets the Mul unit from Skylake Port 5, probably for rebalancing purposes, but there is also an integer divider unit here. This addition is a bit of a tweak, as we know that Cannon Lake has a 64-bit IDIV unit in its design as well. In that case it brings 64-bit integer division down from 97 clocks (a mix of instructions) down to 18 clocks, and this is likely something similar. Also on the INT side is that Port 5’s multiply unit has become a ‘MulHi’ unit which in other implementations leave most-significant nibble of the result in a register for further use. I'm not sure exactly what its position is here in the Sunny Cove core though.

On the FP side, Intel has increased the shuffle resources, which we are told was because they had customers requesting this functionality to help remove bottlenecks in their code. Intel didn’t state the capabilities for the FMAs in the FP part of the core, however we do know that the core has an AVX-512 unit, so at least one of these FMAs should interface with it. Cannon Lake only has one 512-bit FMA, and that most likely carries over here, and the Xeon Scalable version might get two. I would like to get proved wrong though.

Other updates to the core as listed from Intel include improvements to the branch predictor and a reduced ‘effective load latency’ (due to the TLB and L1-D), although it was pointed out that these improvements won’t help everyone, and might require new algorithms in order to use specific parts of the core.

We’re waiting to see what changes Intel has made on the front-end, which is where a lot of low-hanging fruit often lies for performance.

On top of the microarchitecture differences, there are new instructions for the Sunny Cove core to help speed up specialist compute tasks. With the AVX-512 unit in play, the new core will support IFMA instructions for big arithmetic compute, and which are also useful in cryptography. Along the same lines, Vector-AES (supporting more AES concurrently) and Vector Carryless Multiply are also supported. SHA and SHA-NI instructrions are also added, along with Galois Field instructions, which are fundamental building blocks in some elements of cryptography as well.

For Sunny Cove, there is also built-in support for larger memory capacities. The paging table for main memory is now a five layer design, up from four layers, supporting a Linear Address space up to 57 bits and a physical address space up to 52 bits. This means, according to Intel, that the server processors could theoretically support 4 TB of memory per socket. We will see Sunny Cove hit the market in the server space with Ice Lake-SP in 2020 according to Intel’s previous Xeon roadmaps.   For security, Sunny Cove has access to Multi-Key Total Memory Encryption, and User Mode Instruction Prevention.

So where does this leave us in terms of Sunny Cove predicted performance compared to Skylake at a clock for clock level? Unfortunately we don’t have enough information. There is only enough to whet the appetite. Intel wants you (and me) to come back for more in 2019.

The CPU Core Roadmaps The Next Generation Gen11 Graphics: Playable Games and Adaptive Sync!
POST A COMMENT

149 Comments

View All Comments

  • CajunArson - Wednesday, December 12, 2018 - link

    There's nothing whatsoever revolutionary about "chiplets". A 10 year old core 2 quad used exactly the same technology that AMD calls "chiplets" in 2019 and AMD fantards like you even insulted the Core 2 quad for doing it.

    Maybe you should actually read the article about what a active interposer can do vs. wiring standard hunks of silicon to a PCB in literally the same way it was done in the 1970s before you run around acting like AMD is the only company to ever innovate anything.
    Reply
  • sgeocla - Wednesday, December 12, 2018 - link

    I've been reading articles about Intel 10nm promises for years now. And then we got failed laptop chips and low power pc boxes to appease 'mass production' status and not get sued by investors for false claims.
    Forgive me if I abstain my drooling until Intel actually delivers something that does not require industrial chillers. BTY where is that 28 core HEDT chips anyway ?
    Reply
  • Targon - Wednesday, December 12, 2018 - link

    There is always a point where the WHEN something is used sets a new direction. Multi-CPU in a workstation/server moving to a single processor with multiple cores was a big shift. Moving from two cores linked together when cache coherency was a big problem to a single dual-core without an interposer was a better way to go. It all comes down to if there is a performance boost or degradation as a result of the technology/implementation.

    With that said, a single CPU with 64 cores is fairly significant, and keeping the performance from being horrible with that many cores is the reason AMD has been praised. Price/performance for the server market and such.

    For a long time, Intel was seen as the king when it came to clock speeds and performance, but Intel hasn't had a significant boost to IPC in over three years. Intel has also been promising 10nm for three years, and still no sign of it, with the promise of holidays 2019.

    So, Intel still has nothing, they have vague promises of ways they will improve performance, but it remains to be seen if the performance will actually be better if 10nm slips again. On the flip side, AMD clearly has significant performance boosts from Ryzen 3rd generation in 2019(March/April being when many expect it). 7nm from AMD isn't a, "will they?" question, it isn't even a "when?", with CES in one month and with it, the answers. IPC improvements due to design improvements not related to chiplets at all would be good, as well as higher clock speeds. So, there is a potential for 30+ percent higher performance in one generation.

    Yes, I don't expect AMD to deliver huge performance jumps again for years, but we may see things such as Gen-Z support, going beyond two memory channels for the mainstream Ryzen chips when the next socket comes out in 2020/2021, and other things that may boost system/platform performance while AMD figures out how to get more CPU performance.

    Intel is still trying to do things the same way, just faster. Faster CPU, faster links to individual devices, fabric on a system level will be Intel trying to reinvent what AMD has been working toward.

    I will also note again that some things are not always about being new, but are more about presentation and implementation. Palm really popularized the idea of apps that users could install on a small portable device(PDA), but Apple popularized it with the iPhone. In some cases, the implementation really is good, and will get the respect of the industry, in other cases, you see that something is clearly a case of following the lead of another player.

    So, in the PC industry, is Intel leading the way with innovations, or is AMD in the drivers seat?
    Reply
  • iwod - Thursday, December 13, 2018 - link

    No one insulted Core 2 Quad for doing it, and neither did AMD. But Intel did Insult AMD and went on full force bad mouthing AMD. Reply
  • Spunjji - Thursday, December 13, 2018 - link

    Using a term like "fantard" straight-up devalues your argument, but the blatantly false statement about the C2Q using "exactly the same technology" seals the deal.

    Chiplets refers to the CPU being divided into multiple sections (cores and un-core) on a single package using dedicated interconnects. It's not at all the same technology as having two discrete CPUs joined by the FSB on a single package. Both are novel approaches to particular problems, although the C2Q (and Pentium D before it) were criticized for their inefficiency by using the FSB for inter-core communication. We don't know how "chiplets" will pan out yet, so the jury's out.

    Bash the fans for talking nonsense all you want, but maybe don't sink to their level.
    Reply
  • edzieba - Wednesday, December 12, 2018 - link

    If you think through-package interconnects compare to through-silicon interconnects, then I have some HBM on DIMMs to sell you. Reply
  • Spunjji - Thursday, December 13, 2018 - link

    Noice. :D Reply
  • III-V - Wednesday, December 12, 2018 - link

    I love how everyone thinks AMD is the pioneer with chiplets. They're not. That would be Marvell.

    And Intel themselves has been hinting that it's a good way to go, looking at their EMIB solution.

    But AMD fan boys are a special breed of stupid...
    Reply
  • sgeocla - Wednesday, December 12, 2018 - link

    The electric car was pioneered more than a hundred years.
    It's one thing to pioneer something and a whole different thing to actually develop it into something that is affordable to millions and drags the whole industry forward.

    If you think pioneering is all there is to it I have hundreds of grapehene battery designs you should invest you narrow-minded-driven life savings into.
    Reply
  • evernessince - Wednesday, December 12, 2018 - link

    You have some issues buddy. How about not being toxic next time. Reply

Log in

Don't have an account? Sign up now