The Front End: Branch Prediction

Bulldozer's branch prediction units have been described in many articles. Most insiders agree that Bulldozer's decoupled branch predictor is a step forward from the K10's multi-level predictor. A better predictor might reduce the branch misprediction rate from 5 to 4%, but that is not the end of the story. Here's a quick rundown of the branch prediction capabilities of various CPU architectures.

Branch Prediction
Architecture Branch Misprediction Penalty
AMD K10 (Barcelona, Magny-Cours) 12 cycles
AMD Bulldozer 20 cycles
Pentium 4 (NetBurst) 20 cycles
Core 2 (Conroe, Penryn) 15 cycles
Nehalem 17 cycles
Sandy Bridge 14-17 cycles

The numbers above show the minimum branch misprediction penalty, and the fact is that the Bulldozer architecture has a branch misprediction penalty that is 66% higher than the previous generation. That means that the branch prediction of Bulldozer must correctly predict 40% of the pesky branches that were mispredicted by the K10 to compensate (at the same clock). Unfortunately, that kind of massive branch prediction improvement is almost impossible to achieve.

Quite a few people have commented that Bulldozer is AMD's version of Intel's Pentium 4: it has a long pipeline, with high branch misprediction penalties, and it's built for high clock speeds that it cannot achieve. The table above seems to reinforce that impression, but the resemblance between Bulldozer and NetBurst is very superficial.

The minimum branch prediction penalty of the Bulldozer chip is indeed in the same range as Pentium 4. However, the maximum penalty could be a horrifying 100 cycles or more on the P4, while it's a lot lower on Bulldozer. In most common scenarios, the Bulldozer's branch misprediction penalty will be below 30 cycles.

Secondly, the Pentium 4's pipeline was 28 ("Willamette") to 39 ("Prescott") cycles. Bulldozer's pipeline is deep, but it's not that deep. The exact number is not known, but it's in the lower twenties. Really, Bulldozer's pipeline length is not that much higher than Intel's Nehalem or Sandy Bridge architectures (around 16 to 19 stages). The big difference is that the introduction of the µop cache (about 6KB) in Sandy Bridge can reduce the typical branch misprediction to 14 cycles. Only when the instruction is not found in the µop cache and must be fetched from the L1 data cache will the branch misprediction penalty increase to about 17 cycles. So on average, even if the efficiency of Bulldozer's and Sandy Bridge's branch predictors is more or less the same, Sandy Bridge will suffer a lot less from mispredictions.

The Front End: Shared Decoders

Quite a few reviewers, including our own Anand, have pointed out that two integer cores in Bulldozer share four decoders, while two integer cores in the older “K10” architecture each get three decoders. Two K10 cores thus have six decoders, while two Bulldozer cores only have four. Considering that the complexity of the x86 ISA leads to power hungry decoders, reducing the power by roughly 1/3 (e.g. four decoders instead of six for dual-core) with a small single-threaded performance hit is a good trade off if you want to fit 16 of these integer cores in a power envelope of 115W. Instead of 48 decoders, Bulldozer tries to get by with just 32.

The single-threaded performance disadvantage of sharing four decoders between two integer cores could have been lessened somewhat by x86 fusion (test + jump and CMP + jump; Intel calls this macro-op fusion) in the pre-decoding stages. Intel first introduced this with their “Core” architecture back in 2006. If you are confused by macro-ops and micro-ops fusion, take a look here.

However AMD decided to introduce this kind of fusion in Bulldozer later in the decoding pipeline than Intel, where x86 branch fusion is already present in the predecoding phases. The result is that the decoding bandwidth of all Intel CPUs since Nehalem has been up to five (!) x86-64 instructions, while x86 branch fusion does not increase the maximum decode rate of a Bulldozer module.

This is no trifle, as on average this kind of x86 fusion can happen once every ten x86 instructions. So why did AMD let this chance to improve the effective decoding rate pass even if that meant creating a bottleneck in some applications? The most likely reason is that doing this prior to decoding increases the complexity of the chip, and thus the power consumption. Even if AMD's version of x86 branch fusion does not increase the decoding bandwidth, it still offers advantages:

  • Increased dispatch bandwidth
  • Reduced scheduler queue occupancy
  • Faster branch misprediction recovery

The first two increase performance without any extra (or very minimal) power consumption, the last one increases performance and reduces power consumption. AMD preferred to get more cores in the same power envelop over higher decode bandwidth and thus single-threaded performance.

Mark of Hardware.fr compared the performance of a four module CPU with only one core per module enabled with the standard configuration (two integer cores per module). Lightly threaded games were 3-5% faster, which is the first indication that the front end might be something of a bottleneck for some high IPC workloads, but not a big one.

The Memory Subsystem

One of the most important features of Intel's Core architecture was its speculative out-of-order memory pipeline. It gave the Core architecture a massive improvement in many integer benchmarks over the K8, which had a strictly in order pipeline. Barcelona improved this a bit by bringing the K10 to the level of the much older PIII architecture: out of order, but not speculative. Bulldozer now finally has memory disambiguation, a feature which Intel introduced in 2006 in their Core architecture, but there's more to the story.

Bulldozer can have up to 33% more memory instructions in flight, and each module (two integer stores) can do four load/stores per cycle. It's clear that AMD’s engineers have invested heavily in Memory Level Parallelism (MLP). Considering that MLP is often the most important bottleneck in server workloads, this is yet another sign that Bulldozer is targeted at the server world. In this particular area of its architecture, Bulldozer can even beat the Westmere Intel CPUs: two threads on top of the current Intel architecture have only 2 load/stores available and have to share the L1 data cache bandwidth.

The memory controller is improved too, as you can see in the stream benchmark results below. For details about our Stream binary, check here.

Stream Triad

Bandwidth is 25% higher than Barcelona, while the clock speed of the RAM modules has only increased by 20%. Clearly, the Orochi die of the Opteron 6200 has a better memory controller than the Opteron 6100. The memory controller and load/store units are among the strongest parts of the Bulldozer architecture.

Bulldozer, Still a Mystery Setting Expectations: the Back End
POST A COMMENT

84 Comments

View All Comments

  • Zoomer - Thursday, May 31, 2012 - link

    True. It's probably better out way back then, but synthesized, than to come out maybe next year with all their lovingly fully customized, hand placed transistors. That's if they don't go bankrupt first.

    wolfman3k5's probably going to call nVidia, 3dfx, ATi (then), most FPGA program design houses, etc, lazy, too.
    Reply
  • misiu_mp - Monday, June 11, 2012 - link

    A large margin of error means that you have a lot of space to make errors with little consequence.
    You meant of course that engineers have small margins of error in their work.
    Reply
  • 500MM - Wednesday, May 30, 2012 - link

    http://images.anandtech.com/graphs/graph5057/42770...

    If lower was better, AMD would have one kickass CPU. The caption is wrong.
    Reply
  • JohanAnandtech - Friday, June 1, 2012 - link

    Fixed, thx! Reply
  • weebnuts - Wednesday, May 30, 2012 - link

    The problem with all these benchmarks is that most organizations are going to be using this is Xen or Vmware uses. The idea is that with more cores, you can run more VM's especially if you are trying to implement Virtual Desktops. How do the processors compare when you are loading the server to 80-90% capacity with lots of VM's? That's a real world comparison I want to see. Reply
  • Iketh - Wednesday, May 30, 2012 - link

    I was dying for information like this. Thank you!

    And as for that quote on the first page by Iketh, that guy is a genious!! :D
    Reply
  • Aone - Thursday, May 31, 2012 - link

    1) Maybe i missed something but, Should "Higher is better" be for "Data Cache hitrate", i.e. opposite to cache misses?

    2) And on the chart "L2 Cache hitrate", is it correct that "Opteron 6276" tag is shown on first line while "Opteron 6174" on the last line? I thought Opteron 6174 was faster in MS SQL than Opteron 6276.
    Reply
  • mrdcook - Thursday, May 31, 2012 - link

    There are a few new instructions in Bulldozer's architecture that, for certain specific computations, can make it 10X faster than Intel. For example, FMA. An FMA does a multiply and then an add as one instruction, rounding only once. Combining the multiply and the add isn't such a big deal (and in many cases can even be counter-productive), but rounding only once is very important in some cases.

    For example, assume you have 3 digits of accuracy and want to calculate (1.23 * 2.31 - 2.84). Without FMA, you calculate Round(1.23 * 2.31) = 2.84, then you calculate Round(2.84 - 2.84) = 0. With FMA, you calculate 1.23 * 2.31 = 2.8413, then you calculate Round(2.8413 - 2.84) = 0.0013. While that may seem contrived (it was!), the difference is significant in certain simulations and calculations.

    When doing math, computers have a very specific level of accuracy -- a certain number of digits of precision. If you want your simulations to come out right, you have to take these limits into account. Learning how to account for the computer's rounding errors is a bit of a black art.

    Mathematicians design algorithms in terms of matrix multiplications and dot products, and if you translate those algorithms directly into computer multiplications and additions, you tend to end up with a lot of cancellation errors like the example given above. You can hire a computer science grad student to rework your algorithm to not lose accuracy, but that is expensive and has to be done for every new algorithm. Or you can use an FMA for the dot products and the matrix multiplications (the high-accuracy dot product and matrix multiplication libraries already do this).

    FMA in software is slow. Single-precision emulated FMA isn't too bad since you can use double-precision to help with the hardest bits of the emulation. The result is that you can do one fmaf in about 4X the amount of time it would take to do a single a*b+c. However, SSE2 allows you to do 4 a*b+c at a time, so emulated single-precision FMA ends up being about 15X slower than optimized SSE2 non-fused multiply-add. Double-precision is harder, taking about 10 times longer than a single a*b+c, so it ends up being 20X slower than non-fused multiply-add.

    Admittedly, the target market for FMA is probably smaller than a breadbox, but those who need it really need it. And as it becomes more common, it'll only become more important. For now, since only Bulldozer has it, nobody is going to care.
    Reply
  • BaronMatrix - Thursday, May 31, 2012 - link

    There are admittedly only two viable X86 licensees in America and one of them sucks... Reply
  • shodanshok - Thursday, May 31, 2012 - link

    Hi Johan,
    first of all, let me thank you for your wonderful analysis on Bulldozer architecture. I read it with great interest.

    However, I think that you left out a very important thing to mention: L1/L2 cache read/write bandwidth. Especially for L2, while latency is an important thing, throughput can be an even more crucial one.

    The key point is that Bulldozer has an write-through L1 cache, so all L1 writes are more or less immediately broadcasted to L2 cache. Some small writes can be effectively cached inside a write-back combining buffer called Write Combining Cache (WCC), but this cache is only 4KB in size per the entire module. So, streaming writes will immediatly fill the WCC and bring down L1 cache speed to L2 levels.

    This can really hamper CPU performance. Obviously, AMD went this road for some understandable reasons, however, the WCC is really too small to cache much data and the L2 is way too slow to efficiently serve L1 write requests.

    This bring us to another point: L2 cache is slow. Comparing this with the super-fast (but much smaller) L2 Intel cache, it has no hope; it is more or less at Intel's L3 level.

    Here you can find my analysis of AMD Bulldozer architecture: http://www.ilsistemista.net/index.php/hardware-ana...">AMD Bulldozer analysis
    Note that, while I collected and normalized data from multiple web site, I left very clear what was the original reference (so that you can easily verify my data).

    Thanks.
    Reply

Log in

Don't have an account? Sign up now