The Intel Broadwell Xeon E3 v4 Review: 95W, 65W and 35W with eDRAMby Ian Cutress on August 26, 2015 9:00 AM EST
Our Broadwell coverage on the desktop has included reviews of the two consumer processors and a breakdown of IPC gains from generation to generation. One issue surrounding Broadwell on consumer platforms was that the top quad-core model was rated at one third less power in comparison to previous Intel quad core processors. Specifically, Broadwell is 65W against 84-95W in past generations. This puts Broadwell’s out-of-the-box peak performance at a TDP (and frequency) disadvantage. However in a somewhat under-the-radar launch, Intel also released a series of Broadwell Xeons under the E3-12xx v4 line. We sourced three socketed models, the E3-1285 v4 at 95W, the E3-1285L v4 at 65W and the E3-1265L v4 at 35W to get a better scope of Broadwell's scaling across different power requirements.
Broadwell Xeon Overview
In almost every sense of the word, the launch of Broadwell in a socketed format has been fairly muted. For low-power mobile platforms at 4W and 15W, Broadwell was promoted heavily and the architecture has had many design wins; but for the desktop only two socketed consumer parts were launched. To that end, Intel performed only post-launch sampling for review websites, resulting in many users learning about the performance much later after the official launch (AnandTech had you covered on day one!). Even now, several weeks later, the i7-5775C and i5-5675C are both hard to source in several regions. Complicating matters is that Intel’s platform after Broadwell, Skylake, was launched soon after in early August with a bigger focus on gaming and end-user experiences, as well as the announcement of the Skylake Xeon family integrating into dedicated mobile processors. This has the effect of resigning Broadwell on the desktop to obscurity, whether intentionally or not (cue the conspiracy theorists).
The crumb of comfort in Broadwell is its use of 128MB of eDRAM. This acts as a fully associative last level victim cache (or L4) for the processor, and speeds up certain workloads that are memory dependent and subject to L3 cache misses when data has been previously evicted. This is a narrow double-pumped serial interface capable of delivering 50GB/s bi-directional bandwidth (100GB/s aggregate). Access latency after a miss in the L3 cache is 30 - 32ns, nicely in between an L3 and main memory access. The major benefit in our testing was to the integrated graphics, giving Intel the best integrated graphics in a bit socketed platform where money is no cost. Some reviews also saw that the eDRAM helped in discrete graphics gaming as well, although at a small effect and highly game dependent (but this raises other issues regarding higher performance on lower power/frequency processors with more on-package memory). The main downside of the eDRAM however is that it is CPU resident, not visible from the system agent to the DRAM, and thus only accessible to CPU/GPU workloads rather than accelerating data over the system IO.
What went further under the radar was the Intel Broadwell platform Xeons for the business and server market. We reported on its launch, but there was seemingly nothing front facing about the marketing of these processors, suggesting Intel might be keeping them as a pure business-to-business product. All bar one of these processors also support the eDRAM but also several Xeon-specific features. Back with Haswell, Intel launched a single soldered (BGA) Xeon with eDRAM. By having three socketed variants (LGA) for Broadwell at ‘launch’, it satisfies business customers that want to upgrade from Haswell E3 v3 Xeons but also provides business environments and server environments with the use of that eDRAM. One of the cited uses for it includes active memory databases, giving fewer cache misses by having a larger chunk of faster memory closer to the processing cores.
All the Broadwell Xeons are quad core with hyperthreading, with all bar one having Iris Pro P6300 (the professional version of Iris Pro 6200) on 48 EUs/GT3e. All but one soldered part has the eDRAM disabled. (Note that the E3-1284L v3 is listed by CPU-World but not currently listed at ark.intel.com.) Aside from this, the models differ solely on the basis of processor frequency, graphics frequency and thermal design limits. It is interesting to note the differences between the E3-1285 v4 and the E3-1285L v4. Sitting at 95W and 65W respectively, that 30W difference in TDP is represented by only a 100 MHz difference in the base frequency. This is relatively odd, and suggests that the 65W part, the E3-1285L v4, is a better off-the-wafer part with preferred frequency/voltage characteristics which also costs almost $100 or ~20% less. This plays a significant part in our testing.
The eDRAM stands out compared to previous Xeons, although at the expense of 2MB of L3 cache compared to previous high end quad core models (or i7 equivalents). Some microprocessor analysts have said that the loss of 2MB of L3 is not that important when backed up by 128MB of a fast L4 type of cache, on the basis that the bandwidth of this L4 is 50GB/s and up before you hit main memory.
In a recent external podcast, David Kanter mentioned that for a multiple increase in cache (e.g. 3x), cache misses are decreased on average by the square root of the multiple increase (e.g. √3, or 1.73, 73%). So the movement from 8MB of last level cache to 128MB + 6MB, and despite the minor increase in latency moving out to eDRAM, is an effective 16x increase, reducing cache misses by a factor of four. This means, to quote, ‘if you have eight cache misses per thousand, you are now down to around two’ – I take this to mean a regular user workload but in a higher throughput environment, it could mean the difference between 2% and 0.5% cache misses out to main memory. Because the move out to main memory is such a latency and bandwidth penalty compared to an on-package transfer between the CPU and L3/L4, even a small decrease in cache misses has performance potential when used in the right context. Anand quoted Intel’s Tom Piazza back in our Haswell eDRAM review about the size of the eDRAM, and it was stated that 32MB should be enough, but it was doubled and then doubled it again just to make sure, as well as ‘go big or go home’. This has knock on performance effects.
Users upgrading to Broadwell Xeons from Haswell (or those purchasing new systems outright) will get this eDRAM benefit and a lower cost than previous Xeons – the E3-1285L v3 from the Haswell architecture was launched at a price of $774, compared with the E3-1285L v4 which is $445. For the difference, the Broadwell processor comes with eDRAM, substantially better integrated graphics and all within the same thermal design. At 95W, this difference is from $662 to $556, a much smaller difference. This suggests that on Haswell, the lower power model was harder to produce, whereas with Broadwell that burden shifts on the frequency.
Graphics Virtualization and Upgrades
One of the benefits of the Broadwell Xeons with eDRAM lies in Intel's graphics virtualization technology (GVT). This affords three modes of operation:
The benefits of these virtualization techniques allow data centers to essentially apply an accelerant to each VM depending on the beneftis of the GPU on each workload. With it being directly included in the CPU, no additional hardware is needed. Obviously this makes more sense when each virtual machine requires infrequent access to the integrated graphics, but for everything else, Intel is set to launch it's Valley Vista platform which will adorn three of these CPUs onto an add-in PCIe card.
At IDF San Francisco this year, an announcement passed almost everyone by. Intel described an add-in card coming in Q4 2015 that features three Broadwell-H E3 Xeon processors on a single PCB, each with Iris Pro graphics.
Valley Vista is designed to allow for high density, workload specific work, in particular AVC transcoding. Aside from the slide above, there has been no real details as to how this card will work - if there's a PCIe switch for communication, or if it runs in a virtualized layer, or how the card is powered or if each of the processors on the card will have a fixed amount of DRAM associated with them. So far Supermicro announced in a press release that one of their Xeon Phi platforms is suitable for the cards when they get launched later this year. What we do know however is that Broadwell is not fully HEVC accelerated, so the utility in Valley Vista is most likely to be with AVC encode/decode.
As with previous socket drop-ins on the professional line, Intel is promoting the use of its C226 chipset - for our testing, we used an equivalent Z97 platform which worked as well.
This provides an as-is scenario, with sixteen lanes of PCIe 3.0, two channels of DDR3/L-1600 memory with ECC support, a DMI 2.0 x4 link equivalent to 4 GB/s, up to six native USB 3.0 and SATA 6 Gbps ports, depending on the high-speed IO configuration used in the chipset in conjunction with the eight PCIe lanes.
There isn’t much else to say here – we have covered Broadwell on the desktop and the differences are spelled out for end users despite the current lack of direct availability in certain markets at this time. These are Xeon processors, so no overclocking here, but the main parallel we should be making is the 95W of the E3-1285 v4 and the E3-1276 v3 at 84W. The E3 has some extra frequency (peaks at 4 GHz) and extra L3 cache, but the Xeon has eDRAM.
Compared to Johan’s in depth server reviews, the focus for the testing on this piece is primarily at workstation environments. Because we did not get a 95W ‘consumer’ based Broadwell for comparison, gaming tests were also performed. Unfortunately the Linux based server tests we typically use were not performed due to a spectacular failing of our Ubuntu LiveCD with these processors, even though it worked with the non-Xeon counterparts. We’re still trying to figure this one out but we suspect it is a driver related issue. While in no way similar, in its stead we have SPECviewperf 12 on Windows with a discrete GPU (it's typical use case) as an additional angle of comparison.
A side note to those have recently asked - we are in the process of looking into appropriate repeatable compilation benchmarks and VM environment comparisons. Ideally we are aiming to finalize a series of tests that can be one-click batched and processed within a reasonable testing timeframe. These will not be ready until mid-September at the earliest due to other commitments, but when available we will try and run a number of past systems to acquire appropriate comparative data. To add comments, suggestions or preferences on the tests, please email firstname.lastname@example.org.
|MSI Z97A Gaming 6
|Cooler Master Nepton 140XL
|OCZ 1250W Gold ZX Series
|G.Skill RipjawsZ 4x4 GB DDR3-1866 9-11-11 Kit
|ASUS GTX 980 Strix 4GB
MSI GTX 770 Lightning 2GB (1150/1202 Boost)
ASUS R7 240 2GB
|Crucial MX200 1TB
|Open Test Bed
|Windows 7 64-bit SP1
The dynamics of CPU Turbo modes, both Intel and AMD, can cause concern during environments with a variable threaded workload. There is also an added issue of the motherboard remaining consistent, depending on how the motherboard manufacturer wants to add in their own boosting technologies over the ones that Intel would prefer they used. In order to remain consistent, we implement an OS-level unique high performance mode on all the CPUs we test which should override any motherboard manufacturer performance mode.
All of our benchmark results can also be found in our benchmark engine, Bench.
Many thanks to...
We must thank the following companies for kindly providing hardware for our test bed:
Thank you to AMD for providing us with the R9 290X 4GB GPUs.
Thank you to ASUS for providing us with GTX 980 Strix GPUs and the R7 240 DDR3 GPU.
Thank you to ASRock and ASUS for providing us with some IO testing kit.
Thank you to Cooler Master for providing us with Nepton 140XL CLCs.
Thank you to Corsair for providing us with an AX1200i PSU.
Thank you to Crucial for providing us with MX200 SSDs.
Thank you to G.Skill and Corsair for providing us with memory.
Thank you to MSI for providing us with the GTX 770 Lightning GPUs.
Thank you to OCZ for providing us with PSUs.
Thank you to Rosewill for providing us with PSUs and RK-9100 keyboards.
Load Delta Power Consumption
Power consumption was tested on the system while in a single GTX 770 configuration with a wall meter connected to the OCZ 1250W power supply. This power supply is Gold rated, and as I am in the UK on a 230-240 V supply, leads to ~75% efficiency > 50W, and 90%+ efficiency at 250W, suitable for both idle and multi-GPU loading. This method of power reading allows us to compare the power management of the UEFI and the board to supply components with power under load, the power delta from idle to CPU loading, and all results include typical PSU losses due to efficiency.
Power numbers are typically difficult to gauge as they depend on the stock voltage of the processor and how aggressive the motherboard wants to be in order to ensure stability. If I were thinking from the point of view of the motherboard manufacturer, they are more likely to overvolt a Xeon processor to ensure that stability rather than deal with any unstable platforms. As a result, we get an odd scenario where the 35W processor is almost hitting double the power consumption at load, and the 65W is also above its mark, but the 95W is below. To put an angle on this, the 110W we see on the i7-6700K was in one motherboard, but in another we have seen 76W as well as 84W. Without having access to the BIOS DVFS tables for each processor, it is difficult to tell when we have mismatched data such as this.