NVIDIA's Tegra 3 Launched: Architecture Revealed
by Anand Lal Shimpi on November 9, 2011 12:34 AM ESTOriginally announced in February of this year at MWC, NVIDIA is finally officially launching its next-generation SoC. Previously known under the code name Kal-El, the official name is Tegra 3 and we'll see it in at least one product before the end of the year.
Like Tegra 2 before it, NVIDIA's Tegra 3 is an SoC aimed at both smartphones and tablets built on TSMC's 40nm LPG process. Die size has almost doubled from 49mm^2 to somewhere in the 80mm^2 range.
The Tegra 3 design is unique in the industry as it is the first to implement four ARM Cortex A9s onto a chip aimed at the bulk of the high end Android market. NVIDIA's competitors have focused on ramping up the performance of their dual-core solutions either through higher clocks (Samsung Exynos) or through higher performing microarchitectures (Qualcomm Krait, ARM Cortex A15). While other companies have announced quad-core ARM based solutions, Tegra 3 will likely be the first (and only) to ship in an Android tablet and smartphone in 2011 - 2012.
NVIDIA will eventually focus on improving per-core performance with subsequent iterations of the Tegra family (perhaps starting with Wayne in 2013), but until then Tegra 3 attempts to increase performance by exploiting thread level parallelism in Android.
GPU performance also sees a boon thanks to a larger and more efficient GPU in Tegra 3, but first let's talk about the CPU.
Tegra 3's Four Five Cores
The Cortex A9 implementation in Tegra 3 is an improvement over Tegra 2; each core now includes full NEON support via an ARM MPE (Media Processing Engine). Tegra 2 lacked any support for NEON instructions in order to keep die size small.
NVIDIA's Tegra 2 die
NVIDIA's Tegra 3 die, A9 cores highlighted in yellow
L1 and L2 cache sizes remain unchanged. Each core has a 32KB/32KB L1 and all four share a 1MB L2 cache. Doubling core count over Tegra 2 without a corresponding increase in L2 cache size is a bit troubling, but it does indicate that NVIDIA doesn't expect the majority of use cases to saturate all four cores. L2 cache latency is 2 cycles faster on Tegra 3 than 2, while L1 cache latencies haven't changed. NVIDIA isn't commenting on L2 frequencies at this point.
The A9s in Tegra 3 can run at a higher max frequency than those in Tegra 2. With 1 core active, the max clock is 1.4GHz (up from 1.0GHz in the original Tegra 2 SoC). With more than one core active however the max clock is 1.3GHz. Each core can be power gated in Tegra 3, which wasn't the case in Tegra 2. This should allow for lightly threaded workloads to execute on Tegra 3 in the same power envelope as Tegra 2. It's only in those applications that fully utilize more than two cores that you'll see Tegra 3 drawing more power than its predecessor.
The increase in clock speed and the integration of MPE should improve performance a bit over Tegra 2 based designs, but obviously the real hope for performance improvement comes from using four of Tegra 3's cores. Android is already well threaded so we should see gains in portions of things like web page rendering.
It's an interesting situation that NVIDIA finds itself in. Tegra 3 will show its biggest performance advantage in applications that can utilize all four cores, yet it will be most power efficient in applications that use as few cores as possible.
There's of course a fifth Cortex A9 on Tegra 3, limited to a maximum clock speed of 500MHz and built using LP transistors like the rest of the chip (and unlike the four-core A9 cluster). NVIDIA intends for this companion core to be used for the processing of background tasks, for example when your phone is locked and in your pocket. In light use cases where the companion core is active, the four high performance A9s will be power gated and overall power consumption should be tangibly lower than Tegra 2.
Despite Tegra 3 featuring a total of five Cortex A9 cores, only four can be active at one time. Furthermore, the companion core cannot be active alongside any of the high performance A9s. Either the companion core is enabled and the quad-core cluster disabled or the opposite.
NVIDIA handles all of the core juggling through its own firmware. Depending on the level of performance Android requests, NVIDIA will either enable the companion core or one or more of the four remaining A9s. The transition should be seamless to the OS and as all of the cores are equally capable, any apps you're running shouldn't know the difference between them.
94 Comments
View All Comments
MamiyaOtaru - Friday, November 11, 2011 - link
what i liked about sound*storm* and what has me using cmedia now is DDLB3an - Wednesday, November 9, 2011 - link
Just a little thing, but the Transformer Prime has a IPS+ display, not a typical IPS display which you have listed. Asus clam the + version is 1.5x brighter than a normal IPS display.I'm impressed by the specs of the Prime, in literally EVERY single way (possibly apart the GPU) the Prime better than the iPad 2.... thinner, ligher, better display (apparently), higher res too, twice as much RAM, SD slot, and more than twice as many cores that are each also clocked higher...
If it's this good in the real world then i'll be imprssed that Asus could afford to make such a product and keep it at the same price as the iPad.
name99 - Wednesday, November 9, 2011 - link
"in literally EVERY single way ... better than the iPad 2"You sure about that? You know, for a FACT, that the flash is faster? That the WiFi supports 5GHz and is faster? That there is the same range of sensors (including, eg, magnetometer, accelerometer, gyro, proximity sensor, light sensor, and a dozen I've forgotten --- and that every one of them is better than on iPad2?
There is a HUGE amount to iPad2 that people seem to forget because it's just hiding there under the covers, it doesn't advertise itself.
ncb1010 - Saturday, November 12, 2011 - link
Yes, The prime has a magnetometer, a gyro, a compass and a light sensor according to theverge.com(ex-engadget staff). The base iPad base model is missing a key sensor(GPS) but this includes it at the same price point. The iPad has no flash in any sense of the word(Adobe flash or camera flash) while the transformer has both Adobe Flash and a camera flash so I really don't see how it has faster flash(do you mean shutter speed?). Besides, of all the specs we know on the camera, it looks to be a lot better than the ones Apple put in there to upsell people on the iPad 3. As far as a proximity sensor, what would be the purpose of it? The purpose in the iPad is to detect when the custom cover on the iPad is put on and removed. The Pros on the Optimus Prime hardware wise are numerious while the iPad have some theoretical benefits just because we don't know every single detail on the Prime. You are grasping at straws here.AuDioFreaK39 - Wednesday, November 9, 2011 - link
Quick question for Tegra 3 architecture engineers: Is the "companion core" identified as Core 0 or Core 4? Thanks in advance.Draiko - Wednesday, November 9, 2011 - link
Good question. I'd love a solid answer myself but from the core demo video, it looks like it's core 0.Anonymous Blowhard - Wednesday, November 9, 2011 - link
IANAD (I Am Not A Developer) but I'm betting it's actually still tagged as 0, with lower-level firmware switching as to whether or not "core 0" is the companion core or a full core.Remember that the companion core cannot be run at the same time as the full cores, so it's likely that when the demand-based switching kicks in, "companion core 0" is spun down, "full core 0" is spun up, and the rest of "full core 1/2/3" come online as well.
Since this is happening at the firmware/lower level vis-a-vis x86 "Turbo Core" it will be transparent to the OS.
/but that's, like, just my opinion man
eddman - Wednesday, November 9, 2011 - link
I agree with Anonymous Blowhard. The OS can't see all 5 cores at the same time, so companion core would be 0 when it's enabled.mythun.chandra - Wednesday, November 9, 2011 - link
Core 0 :)allingm - Thursday, November 10, 2011 - link
While you guys are probably right, and it is probably just core 0, there is the possibility that its cores 0 - 4. All 4 threads could simply be run on the one core and this would make it seamless to the OS which seems to be what Nvidia suggests.