NVIDIA Teases Xavier, a High-Performance ARM SoC for Drive PX & AI
by Ryan Smith on September 28, 2016 7:45 AM ESTEver since NVIDIA bowed out of the highly competitive (and high pressure) market for mobile ARM SoCs, there has been quite a bit of speculation over what would happen with NVIDIA’s SoC business. With the company enjoying a good degree of success with projects like the Drive system and Jetson, signs have pointed towards NVIDIA continuing their SoC efforts. But in what direction they would go remained a mystery, as the public roadmap ended with the current-generation Parker SoC. However we finally have an answer to that, and the answer is Xavier.
At NVIDIA’s GTC Europe 2016 conference this morning, the company has teased just a bit of information on the next generation Tegra SoC, which the company is calling Xavier (ed: in keeping with comic book codenames, this is Professor Xavier of the X-Men). Details on the chip are light – the chip won’t even sample until over a year from now – but NVIDIA has laid out just enough information to make it clear that the Tegra group has left mobile behind for good, and now the company is focused on high performance SoCs for cars and other devices further up the power/performance spectrum.
NVIDIA ARM SoCs | |||
Xavier | Parker | Erista (Tegra X1) | |
CPU | 8x NVIDIA Custom ARM | 2x NVIDIA Denver + 4x ARM Cortex-A57 |
4x ARM Cortex-A57 + 4x ARM Cortex-A53 |
GPU | Volta, 512 CUDA Cores | Pascal, 256 CUDA Cores | Maxwell, 256 CUDA Cores |
Memory | ? | LPDDR4, 128-bit Bus | LPDDR3, 64-bit Bus |
Video Processing | 7680x4320 Encode & Decode | 3840x2160p60 Decode 3840x2160p60 Encode |
3840x2160p60 Decode 3840x2160p30 Encode |
Transistors | 7B | ? | ? |
Manufacturing Process | TSMC 16nm FinFET+ | TSMC 16nm FinFET+ | TSMC 20nm Planar |
So what’s Xavier? In a nutshell, it’s the next generation of Tegra, done bigger and badder. NVIDIA is essentially aiming to capture much of the complete Drive PX 2 system’s computational power (2x SoC + 2x dGPU) on a single SoC. This SoC will have 7 billion transistors – about as many as a GP104 GPU – and will be built on TSMC’s 16nm FinFET+ process. (To put this in perspective, at GP104-like transistor density, we'd be looking at an SoC nearly 300mm2 big)
Under the hood NVIDIA has revealed just a bit of information of what to expect. The CPU will be composed of 8 custom ARM cores. The name “Denver” wasn’t used in this presentation, so at this point it’s anyone’s guess whether this is Denver 3 or another new design altogether. Meanwhile on the GPU side, we’ll be looking at a Volta-generation design with 512 CUDA Cores. Unfortunately we don’t know anything substantial about Volta at this time; the architecture was bumped further down NVIDIA’s previous roadmaps for Pascal, and as Pascal just launched in the last few months, NVIDIA hasn’t said anything further about it.
Meanwhile NVIDIA’s performance expectations for Xavier are significant. As mentioned before, the company wants to condense much of Drive PX 2 into a single chip. With Xavier, NVIDIA wants to get to 20 Deep Learning Tera-Ops (DL TOPS), which is a metric for measuring 8-bit Integer operations. 20 DL TOPS happens to be what Drive PX 2 can hit, and about 43% of what NVIDIA’s flagship Tesla P40 can offer in a 250W card. And perhaps more surprising still, NVIDIA wants to do this all at 20W, or 1 DL TOPS-per-watt, which is one-quarter of the power consumption of Drive PX 2, a lofty goal given that this is based on the same 16nm process as Pascal and all of the Drive PX 2’s various processors.
NVIDIA’s envisioned application for Xavier, as you might expect, is focused on further ramping up their automotive business. They are pitching Xavier as an “AI Supercomputer” in relation to its planned high INT8 performance, which in turn is a key component of fast neural network inferencing. What NVIDIA is essentially proposing then is a beast of an inference processor, one that unlike their Tesla discrete GPUs can function on a stand-alone basis. Coupled with this will be some new computer vision hardware to feed Xavier, including a pair of 8K video processors and what NVIDIA is calling a “new computer vision accelerator.”
Wrapping things up, as we mentioned before, Xavier is a far future product for NVIDIA. While the company is teasing it today, the SoC won’t begin sampling until Q4 of 2017, and that in turn implies that volume shipments won’t even be until 2018. But with that said, with their new focus on the automotive market, NVIDIA has shifted from an industry of agile competitors and cut-throat competition, to one where their customers would like as much of a heads up as possible. So these kinds of early announcements are likely going to become par for the course for NVIDIA.
35 Comments
View All Comments
MrSpadge - Wednesday, September 28, 2016 - link
7 billion transistors for "just" 512 CUDA cores, and 43% the performance of P40? Assuming Pascal CUDA cores this would imply a clock speed of ~4.9 GHz. Nope, I'm pretty sure the additional transistors went either into significantly redisigned SMs (wider CUDA cores) or the accelerators are performing some magic. Quite interesting!Vatharian - Wednesday, September 28, 2016 - link
I vote for magic. In fact, PMoS (Probability Manipulation on Silicon) could be next big thing in automotive industry. Beside increasing apparent performance of the AI behind the whell it would help reduce danger in case of collision.Qwertilot - Wednesday, September 28, 2016 - link
Don't forget the 8 arm based cores, which will presumably give a fair chunk of performance.name99 - Wednesday, September 28, 2016 - link
Only if they are hooked up to a decent memory system. That has been conspicuously lacking in mobile ARM systems so far. Look at the usual ARM scaling --- 4+4 systems like A57+A53 or A72+A53 get about 3x the throughput of a single core.It's not enough just to have a wider bus, you need more parallelism throughout the memory system, which in turn means you need multiple (at least two) memory controllers, memory buses, and physical memory slivers of silicon. But if there's one thing manufacturers hate, it's adding more pins and more physical slivers of silicon. So I fully expect this to be as crippled as all its predecessors, able to run those three cores for a mighty 3x throughput compared to one core.
As the saying goes, amateurs talk core count, professionals talk memory subsystem.
londedoganet - Wednesday, September 28, 2016 - link
I think you'll have to wait for the nVidia Maximoff (as in Wanda) to see a PMoS implementation. ;-)hahmed330 - Wednesday, September 28, 2016 - link
I think they have quite definitely increased clocks per cycle... If you have 4 clocks per cycle at 2.7GHz on 512 cores gives you 5.5 Teraflops or 22.1184 Deep Learning Tera-Ops. All of this at 20 watts which includes the cpus it would be just insanely amazing..Krysto - Wednesday, September 28, 2016 - link
Which by the way is exactly the performance Tesla P4 gets for 75W of power. Strange that Ryan chose to compare Xavier with "43% of P40's performance", when the P4 comparison was much more apt.p1esk - Wednesday, September 28, 2016 - link
He compared it to P40 because it's currently the fastest Nvidia card when it comes to INT8 performance.Vatharian - Wednesday, September 28, 2016 - link
Jokes aside, big part of the performance is probably tailoring the cores for integer operations. Abandoning compatibility with heavy FP, would give space for heavy parallelization within single core, especially eith such high transistor count.Samus - Wednesday, September 28, 2016 - link
Volta is their first microarchitecture to be built from the ground up for 16nm so it's likely to have even wider optimization over Pascal than Pascal had over Maxwell.I'm still as surprised as you are though. 7 billion transistors with 512 cores could mean some epic sized caches too.