NVIDIA @ ICML 2015: CUDA 7.5, cuDNN 3, & DIGITS 2 Announced
by Ryan Smith on July 7, 2015 4:00 AM ESTTaking place this week in Lille, France is the 2015 International Conference on Machine Learning, or ICML. Now in its 32nd year, the annual event is one of the major international conferences focusing on machine learning. Coinciding with this conference are a number of machine learning announcements, and with NVIDIA now heavily investing in machine learning as part of their 2015 Maxwell and Tegra X1 initiatives with a specific focus on deep neural networks, NVIDIA is at the show this year to release some announcements of their own.
All-told, NVIDIA is announcing new releases for three of their major software libraries/environments, CUDA, cuDNN, and DIGITS. While NVIDIA is primarily in the business of selling hardware, the company has for some time now focused on the larger GPU compute ecosystem as a whole as a key to their success. Putting together useful and important libraries for developers helps to make GPU development easier and to attract developer interest from other platforms. Today’s announcements in turn are Maxwell and FP16-centric, with NVIDIA laying the groundwork for neural networks and other half-precision compute tasks which the company believes will be important going forward. Though the company only has a single product so far that has a higher performance FP16 mode – Tegra X1 – it has more than subtly been hinted at that the next-generation Pascal GPUs will incorporate similar functionality, making for all the more reason for NVIDIA to get the software out in advance.
CUDA 7.5
Starting things off we have CUDA 7.5, which is now available as a release candidate. The latest update for NVIDIA’s GPU compute platform is a smaller release as one would expect for a half-version update, and is primarily focused on laying the API groundwork for FP16. To that end CUDA 7.5 introduces proper support for FP16 data, and while non-Tegra GPUs still don’t receive a compute performance benefit from using FP16 data, they do benefit from reduced memory pressure. So for the moment NVIDIA is enabling this feature for developers to take advantage of any performance benefits from the reduced memory bandwidth needs and/or allowing for larger datasets in the same amount of GPU memory.
Meanwhile CUDA 7.5 is also introducing new instruction level profiling support. NVIDIA’s existing profiling tools (e.g. Visual Profiler) already go fairly deep, but now the company is looking to go one step further in helping developers identify specific code segments and instructions that may be holding back performance.
cuDNN 3
NVIDIA’s second software announcement of the day is the latest version of the CUDA Deep Neural Network library (cuDNN), NVIDIA’s collection of GPU accelerated neural networking functions, which is now up to version 3. Going hand-in-hand with CUDA 7.5, a big focus on cuDNN 3 is support for FP16 data formats for existing NVIDIA GPUs in order to allow for more efficient memory and memory bandwidth utilization, and ultimately larger data sets.
Meanwhile separate from NVIDIA’s FP16 optimizations, cuDNN 3 also includes some optimized routines for Maxwell GPUs to speed up overall performance. NVIDA is telling us that FFT convolutions and 2D convolutions have both been added as optimized functions here, and that they are touting an up to 2x increase in neural network training performance on Maxwell GPUs.
DIGITS 2
Finally, built on top of CUDA and cuDNN is DIGITS, NVIDIA’s middleware for deep learning GPU training. First introduced just back in March at the 2015 GPU Technology Conference, NVIDIA is rapidly iterating on the software with version 2 of the package. DIGITS, in a nutshell, is NVIDIA’s higher-level neural network software for general scientists and researchers (as opposed to programmers), offering a more complete neural network training system for those users who may not be accomplished computer programmers or neural network researchers.
NVIDIA® DIGITS™ Deep Learning GPU Training System
DIGITS 2 in turn introduces support for training neural networks over multiple GPUs, going hand-in-hand with NVIDIA’s previously announced DIGITS DevBox (which is built from 4 GTX Titan Xs). All things considered the performance gains from using multiple GPUs are not all that spectacular – NVIDIA is touting just a 2x performance increase in going from 1 to 4 GPUs – though for performance-bound training this none the less helps. Looking at NVIDIA’s own data, it looks like scaling from 1 to 2 GPUs is rather good, but scaling from 2 to 4 GPUs is where the performance gains from scaling slow down, presumably due to a combination of bus traffic and synchronization issues over a larger number of GPUs. Though on that note, it does make me curious whether the Pascal GPUs and their NVLink buses will improving multi-GPU scaling at all in this scenario.
In any case, the preview release of DIGITS 2 is now available from NVIDIA, though the company has not stated when a final version will be made available.
Source: NVIDIA
26 Comments
View All Comments
p1esk - Tuesday, July 7, 2015 - link
FP64 performance is being reduced because NVIDIA currently focuses on a single application: deep learning. For deep learning, 16 bits precision is enough.Ryan Smith - Tuesday, July 7, 2015 - link
"if Nvidia is concentrating on half-precision workload performance, does that mean 32 and 64 bit performance will be worse?"No. They way they're accomplishing it is by packing two FP16 instructions inside a single FP32 instruction as a Vec2. Currently FP32 ALUs can only run a single FP16 instruction, so this allows a great increase in FP16 performance without adding any more real hardware.
p1esk - Tuesday, July 7, 2015 - link
Ryan, are you implying that a single FP32 ALU will be capable of executing two 16 bit FP operations simultaneously? I don't see how is this possible. Care to explain?MrSpadge - Wednesday, July 8, 2015 - link
You have to wire things more flexibly, that's why such units need more space & transistors.p1esk - Wednesday, July 8, 2015 - link
LOL, I'd like more info than "wire things more flexibly".KhanTengri - Wednesday, July 8, 2015 - link
The NVIDIA DIGITS Devbox uses a ASUS X99E-WS motherboard and supports 4 Way GPU's (16x16x16x16) via multiplexing - the CPU provides only 40 PCIe lanes. To achieve better scaling results an other motherboard (dual xeon) would be preferable ... ASUS Z10PE-D8 WS or Supermicro X10DRG-Q.