Qualcomm Demos 48-Core Centriq 2400 Server SoC in Action, Begins Sampling
by Anton Shilov on December 16, 2016 6:00 PM ESTQualcomm this month demonstrated its 48-core Centriq 2400 SoC in action and announced that it had started to sample its first server processor with select customers. The live showcase is an important milestone for the SoC because it proves that the part is functional and is on track for commercialization in the second half of next year.
Qualcomm announced plans to enter the server market more than two years ago, in November 2014, but the first rumors about the company’s intentions to develop server CPUs emerged long before that. In fact, being one of the largest designers of ARM-based SoCs for mobile devices, Qualcomm was well prepared to move beyond smartphones and tablets. However, while it is not easy to develop a custom ARMv8 processor core and build a server-grade SoC, building an ecosystem around such chip is even more complicated in a world where ARM-based servers are typically used in isolated cases. From the very start, Qualcomm has been rather serious not only about the processors themselves but also about the ecosystem and support by third parties (Facebook was one of the first companies to support Qualcomm’s server efforts). In 2015, Qualcomm teamed up with Xilinx and Mellanox to ensure that its server SoCs are compatible with FPGA-based accelerators and data-center connectivity solutions (the fruits of this partnership will likely emerge in 2018 at best). Then it released a development platform featuring its custom 24-core ARMv8 SoC that it made available to customers and various partners among ISVs, IHVs and so on. Earlier this year the company co-founded the CCIX consortium to standardize various special-purpose accelerators for data-centers and make certain that its processors can support them. Taking into account all the evangelization and preparation work that Qualcomm has disclosed so far, it is evident that the company is very serious about its server business.
From the hardware standpoint, Qualcomm’s initial server platform will rely on the company’s Centriq 2400-series family of microprocessors that will be made using a 10 nm FinFET fabrication process in the second half of next year. Qualcomm does not name the exact manufacturing technology, but the timeframe points to either performance-optimized Samsung’s 10LPP or TSMC’s CLN10FF (keep in mind that TSMC has a lot of experience fabbing large chips and a 48-core SoC is not going to be small). The key element of the Centriq 2400 will be Qualcomm’s custom ARMv8-compliant 64-bit core code-named Falkor. Qualcomm has yet has to disclose more information about Falkor, but the important thing here is that this core was purpose-built for data-center applications, which means that it will likely be faster than the company’s cores used inside mobile SoCs when running appropriate workloads. Qualcomm currently keeps peculiarities of its cores under wraps, but it is logical to expect the developer to increase frequency potential of the Falkor cores (vs mobile ones), add support of L3 cache and make other tweaks to maximize their performance. The SoCs do not support any multi-threading or SMP technologies, hence boxes based on the Centriq 2400-series will be single-socket machines able to handle up to 48 threads. The core count is an obvious promotional point that Qualcomm is going to use over competing offerings and it is naturally going to capitalize on the fact that it takes two Intel multi-core CPUs to offer the same amount of physical cores. Another advantage of the Qualcomm Centriq over rivals could be the integration of various I/O components (storage, network, basic graphics, etc.) that are now supported by PCH or other chips, but that is something that the company yet has to confirm.
From the platform point of view, Qualcomm follows ARM’s guidelines for servers, which is why machines running the Centriq 2400-series SoC will be compliant with ARM’s server base system architecture and server base boot requirements. The former is not a mandatory specification, but it defines an architecture that developers of OSes, hypervisors, software and firmware can rely on. As a result, servers compliant with the SBSA promise to support more software and hardware components out-of-the-box, an important thing for high-volume products. Apart from giant cloud companies like Amazon, Facebook, Google and Microsoft that develop their own software (and who are evaluating Centriq CPUs), Qualcomm targets traditional server OEMs like Quanta or Wiwynn (a subsidiary of Wistron) with the Centriq and for these companies having software compatibility matters a lot. On the other hand, Qualcomm’s primary server targets are large cloud companies, whereas server makers do not have their samples of Centriq yet.
During the presentation, Qualcomm demonstrated Centriq 2400-based 1U 1P servers running Apache Spark, Hadoop on Linux, and Java: a typical set of server software. No performance numbers were shared and the company did not open up the boxes so not to disclose any further information about the CPUs (i.e., the number of DDR memory channels, type of cooling, supported storage options, etc.).
Qualcomm intends to start selling its Centriq 2400-series processors in the second half of next year. Typically it takes developers of server platforms a year to polish off their designs before they can ship them, normally it would make sense to expect Centriq 2400-based machines to emerge in the latter half of 2H 2017. But since Qualcomm wants to address operators of cloud data-centers first and companies like Facebook and Google develop and build their own servers, they do not have to extensively test them in different applications, but just make sure that the chips can run their software stack.
As for the server world outside of cloud companies, it remains to be seen whether the server industry is going to bite Qualcomm’s server platform given the lukewarm welcome for ARMv8 servers in general. For these markets, performance, compatibility, and longevity are all critical factors in adopting a new set of protocols.
Related Reading:
- Evaluating Futuremark's Servermark VDI on the Supermicro SYS-5028D-TN4T
- New GIGABYTE Server Motherboards Show Xeon D Round 2
- AMD Exits Dense Microserver Business, Ends SeaMicro Brand
Source: Qualcomm
88 Comments
View All Comments
MrSpadge - Monday, December 19, 2016 - link
Side note: Samsung had some Exynos' on their 32 nm process.deltaFx2 - Saturday, December 17, 2016 - link
"You'll note that frequency decreases in Intel processors are core count increases". Your observation is correct; your conclusion is entirely wrong; ARM, SPARC, Power etc will all face this bottleneck. To understand this, you have to first realize that CPUs ship with a TDP. Imagine a 4 core CPU at 95W TDP vs. an 8-core CPU at 95W TDP. The 8-core is a bigger die (more transistors), and is doing more work. Simple physics dictates that at the same voltage and frequency, the 8-core will burn more power. But, we've set the TDP at 95W (cooling solutions get more expensive), so the 8-core drops voltage and frequency to meet this TDP. Many memory-bound or I/O bound workloads do not scale with frequency, so the trade-off can be a net win for many applications. (It may be possible to overclock the 8-core to that of the 4-core with fancy cooling. Don't count on it though. It's possible that the 8-core is a binned part that leaks too much at high V, or is unstable at high f).As to the x86 vs ARM myth... the "efficiency" of ARM is a canard at least as far as high performance CPUs are concerned. Sure it increases the die size and possibly a smidgen more power, but in a high performance CPU, the decoder is tiny compared to the size of the core itself, and the massive caches. Most of the power burnt is in the out-of-order engine (scheduler, load-store) which isn't drastically different in x86 vs ARM.
Also recall that x86 displaced supposedly superior RISC processors in servers (Sparc, itanium, alpha, power, pa-risc etc). At a time when the decoder was a larger fraction of the die. Also, if single threaded performance didn't matter, AMD bulldozer and derivatives would be ruling the roost. Bulldozer family's single threaded performance is still higher than any ARM provider, and excavator went a long way in addressing the power. AMD has less than 1% market share in servers. Even Sun's Niagara, which was in-order was forced to move to OoO to address weak single threaded performance. That ought to tell you something.
pkese - Sunday, December 18, 2016 - link
I think you are right. The thing is that x86 instructions being so complex to decode, forced Intel to store decoded instructions into a dedicated μop cache. That not only saves power (for fetching these and decoding these instructions) but also skips some steps in the pipeline when caches get hit.ARM on the other hand, having a more simple instruction set, can decode instructions on the fly, so it doesn't need μop cache. But then it still needs to fetch and decode them at each cycle. When you multiply that with somewhat worse code density for ARM (i.e. more bytes per each ARM instruction as compared to x86), you probably start wasting picojoules big time.
Wilco1 - Sunday, December 18, 2016 - link
No you're both wrong. A micro-op cache helps x86 power and performance, but it adds a lot of extra complexity and design time. And it will still be more power hungry compared to a RISC CPU that can decode instructions in just 1 cycle (see the Cortex-A73 article). It's also incorrect to suggest that the only differences are in the decoder - the ISA permeates throughout the CPU, think partial flag and register writes, memory ordering, micro coding support, the stack engine, extra load pipeline to support load+op, unaligned access/fetch, better branch handling required if not having conditional execution or efficient CSEL etc etc.Note that code density of Thumb-2 is far better than x86, and density of AArch64 is far better than x64. The idea that CISC has smaller codesize is a myth just like "the ISA doesn't matter".
deltaFx2 - Sunday, December 18, 2016 - link
@Wilco1: You have no idea what you're talking about. Let's start with the x86 ISA vs A64 (not Aarch32). The chief complexity with x86 decode is variable instruction length, not CISC. Most typical code paths use x86's simpler instructions (including ld-op) that do not crack into more than 2 ops.A73 is a wimpy 2-wide OoO core at a relatively low fmax. Nobody in their right mind would use that thing in a server. Apple's A* are a far better example as they are 6-wide. Their minimum mispredict penalty is 9-10 cycles (down from 14 earlier) and they max out at 2.3 GHz. Does that sound like one stage for decode? Hint: Complexity is a super-linear, often quadratic function of machine width.
"density of AArch64 is far better than x64": I'm searching for a word that describes the excrement of a bovine, but it escapes me. Some third-party data please.You mean to say that an ISA that has constant 4-byte inst, and no ld-op/ld-op-store or complex ops has better code density than x86? The average instruction length of an x86 instruction is a little over 2 bytes for integer workloads. AVX* may be longer than 4-bytes but it does 256-bit/512-bit arithmetic; ARMV8 tops out at 128 bit.
x86 has a CSEL instruction. It's called cmovcc, and it's been there since before Aarch64 was a glint in ARM's eye. As had Power and every ISA worth its salt.
Stack engine: Stack engine pre-computes SP relative pushes and pops so that you don't need to execute them in the scheduler. Power feature. Aarch64 also has a stack, as does any CPU ever designed.
Partial Register Writes: Apart from AH and AL (8-bit legacy), x86 zeroes out the upper bits of RAX when EAX or AX is written (Same as ARM's W0->X0). Nothing's partial. At any rate, partial writes only create a false dependency.
Memory ordering: This is a minor quibble. It's a little more hardware complexity vs. expecting s/w to put DSBs in their code (which is expensive for performance. And people may go to town over it because debugging is a nightmare vs strong models). SPARC also uses TSO, so it's not just x86.
Aarch64 supports unaligned loads/stores. There is no extra pipeline needed for ld-op in x86 if you crack it into uops. Where are you coming up with all this?
Fun-fact: Intel dispatches ld-op as 1 op and a max dispatch width of 4. So to get the same dispatch rate in ARM, you'd need to dispatch 8. Even Apple's A10 is 6-wide.
Aarch64 is simpler than x86 but it's not simple by any means. It has ld-op (ld2/ld3/ld4), op-store (st2/st3/st4), load-pair, store-pair, arithmetic with shifts, predicated arithmetic, etc. Just to name a few.
Now on to Thumb/Aarch32:
* Yeah, thumb code is dense, but thumb code is 32-bit. No server would use it.
* Thumb is variable-length. See note above about variable length decode. A slightly simpler version of the x86 problem, but it exists.
* Aarch32 has this fun instruction called push and pop that writes/reads the entire architected register file to/from the stack. So 1 instruction -> 15 loads. Clearly not a straight decode. There are more, but one example should suffice.
* Aarch32 can predicate any instruction. Thumb makes it even fancier by using a prefix (an IT prefix) to predicate up to 4 instructions. Remember instruction prefixes in x86?
*Aarch32 has partial flag writes.
*Aarch32 Neon has partial register writes. S0,S1,S2,S3 = D0,D1 = Q0. In fact, A57 "deals" with this problem by stalling dispatch when it detects this case, and waits until the offending instruction retires.
Plenty of inefficiencies in Aarch32, too many to go over. ARM server vendors (AMCC, Cavium) weren't even supporting it. Apple does, AFAIK.
uopCache: This is a real overhead for x86. That said, big picture: In a server part, well over half the die is uncore. The area savings from the simpler ISA are tiny when you consider I$ footprint blowup. (Yes, it's real. No, you can't mix Aarch32 and A64 in the same program).
ISA does not matter, and has not mattered. x86 has matched and surpassed the performance of entrenched RISC players. Intel's own foray into a better RISC ISA (Itanium) has been an abject failure.
name99 - Monday, December 19, 2016 - link
(("density of AArch64 is far better than x64": I'm searching for a word that describes the excrement of a bovine, but it escapes me.))You do yourself no favors by talking this way. You make some good points but they are lost when you insist on nonsense like the above.
https://people.eecs.berkeley.edu/~krste/papers/EEC...
on page 62 is a table proving the point. If you bother to read the entire thesis, there is also a section describing WHY x86 is not as dense as you seem to think.
Likewise the issue with Aarch64 (and with any instruction set in the 21st C) is not the silly things 1980s you are concerned --- (arithmetic with shifts, load pair, and so on). The 1980s RISC concerns were as much as anything about transistor count, but no-one cares about transistor counts these days. What matters today (and was of course of constant concern during the design of Aarch64) is state that bleeds from one instruction to another, things like poorly chosen flags and "execution modes", and for the most part they don't have those. Of course if you insist on being silly there remains some of that stuff in the Aarch32 portion of ARMv8, but that portion is obviously going to be tossed soon. (Has already been tossed by at least one ARM server architecture, will be tossed by Apple any day now, and eventually Android will follow.)
Similarly you don't seem to understand the issues of stacks. The point is that x86 uses a lot of PUSH and POP type instructions and these are "naturally serializing" in that you can't "naturally" run two or more of them in a single cycle because they change implicit state. (They generate eg a store instruction and an AGEN and a "change the address of the SP".)
Any modern CPU (and that includes Aarch64) does not use pushes and pops --- it keeps the SP fixed during a stack frame and reads/writes relative to that FIXED stack pointer. So there are no push/pop type instructions and no implicit changing of the SP as content is read from or to the stack.
Which means there is also no need for the hassle of a stack engine to deal with the backwardness of using push and pop to manipulate the stack.
And to call Itanium a RISC ISA shows a lack of knowledge of RISC and/or Itanium that is breathtaking.
Kevin G - Monday, December 19, 2016 - link
@name99Interesting choices in that paper on compiler flags used for testing. They weren't set for minimum binary size which would have been a more interesting comparison point (-oS instead of -o2) nor most aggressive optimization (-o3). I do get why they didn't choose -o3 due to how it can inline certain function calls and thus bloat size but the other flags I see as worth enabling to see what it does for code size. If anything, some of the vectorization could have increased binary size on x86 as the vector instructions tend to be on the larger on x86.
The other thing that stands out is the relatively large standard deviation on x86-64 code size. x86 binaries certainly can be smaller than ARMv7 or ARMv8 but there is no guarantee. x86 does beat out traditional MIPS and it would have been nice to see PowerPC included as another data point for comparison.
Additionally it would have been nice to see a comparison with icc instead of gcc for binary size. Granted this is changing the comparison by introducing a complier variable but I suspect that Intel's own compiler is capable of producing smaller code than that of gcc.
In closing, that paper has valuable real data for comparison but I wouldn't call it the last word on x86 code size.
name99 - Monday, December 19, 2016 - link
When you've gone from "x86 in unequivocally denser" to "well, sometimes, under the right circumstances, x86 is denser", this is no longer a dispositive issue and so of no relevance to the conversation.deltaFx2 - Tuesday, December 20, 2016 - link
I owe Wilco an apology for the strong statement. However, the data you presented still does not corroborate his statement that "density of AArch64 is far better than x64". It appears to be in the same error margin. I already conceded in another post that the answer is "it depends". You can't make a blanket statement like that, though.And I was lazy in calling Itanium RISC. I meant designed on RISC principles (I know, it's a VLIW-like processor that goes beyond vanilla VLIW). Intel did do a RISC CPU too at some point that also failed (i960?).
Legacy x86 probably uses a lot of push-pop. Plenty of modern x86 code does what you say (frame-pointer or SP relative loads). But fair enough, if you're not using Aarch32, you won't see a big performance benefit from it other than saving on the SP-relative Agen latency.
I'm being silly by arguing that Aarch64 isn't as "simple" as people think? Load pair, store pair, loads/stores with pre or post indexing, the entire ld1/ld2/ld3/ld4 and their store variants, instructions that may modify flags, just to name a few? C'mon. RISC as a concept has been somewhat nebulous over the years, but I don't believe instructions with multiple destinations (some implicit in the pre/post indexing or flags case) were ever considered RISC. These are multi-op instructions that need a sequencer. BTW, I believe an ISA is a tool to solve a problem, not a religious exercise. ARM is perfectly justified in its ISA choices, but it's not RISC other than the fact that it is fixed instruction-length and load-store (for the most part). Mostly RISC, except when it's not, then?
* ld2/ld3/ld4 etc are limited gather instructions. ARM's canonical example is using ld3 to unpack interleaved RGB channels in say .bmp into a vectors of R, G, and B. st* does the opposite.
Wilco1 - Wednesday, December 21, 2016 - link
Let me post some data for GCC test in SPEC built with latest GCC7.0 using -O3 -fno-tree-vectorize -fomit-frame-pointer. Of course only one datapoint, but is fairly representative for a lot of integer (server) code. The x64 text+rodata size is 3853865, AArch64 is 3492616 (9.4% smaller). Interestingly the pure text size is 2.7% smaller on x64, but the rodata is 92% larger - my guess is that AArch64 inlines more immediates and switch statements as code while x64 prefers using tables.Average x64 instruction size on GCC is 4.0 bytes (text size for integer code without vectorization). As for load+op, there are 101K loads, but only 5300 load+op instructions (i.e 5.2% of loads or 0.7% of all instructions). Half of those are CMP (majority with zero), the rest is mostly ADD and OR (so with a compare+branch instruction, load+op has practically no benefit). There are a total of 2520 cmov's in 733K instructions.
As for RISC, the original principles were fixed instructions/simple decode (no microcode), large register file, load/store, few simple addressing modes, mostly orthogonal 2 read/ 1 write instructions, no unaligned accesses. At the time this enabled very fast pipelined designs with caches on a single chip. ARM used even smaller design and transistor budgets, focussing on a low cost CPU that maximized memory bandwidth (ARM2 was much faster than x86 and 68k designs at the time). I believe the original purist approach taken by MIPS, Sparc and Alpha was insane, and led to initial implementations poisoning the ISA with things like division step, register windows, no byte/halfword accesses, no unaligned accesses, no interlocks, delayed branches etc. Today being purist about RISC doesn't make sense at all.
As for AArch64, load/store pair can be done as 2 accesses on simpler cores, so it's just a codesize gain, while faster cores will do a single memory access to increase bandwidth. Pre/post-indexing is generally for free on in-order cores while on OoO cores it helps decode and rename (getting 2 micro-ops from a single instruction). LD2/3/4 are indeed quite complex, but considered important enough to enable more vectorization.
The key is that the complex instructions are not so complex they complicate things too much and slow down the CPU. In fact they enable implementations (both high-end as well as low-end) to gain various advantages (codesize, decode width, rename width, L1 bandwidth, power etc). And this is very much the real idea behind RISC - keeping things as simple as possible and only add complexity when there is a quantifiable benefit that outweighs the cost.