TechnologyIt would have been better if this wasn't just a science project

It would have been better if this wasn’t just a science project

Big Blue was one of the system designers who caught the accelerator error early on and emphatically declared that in the long run, all types of high-performance computing will have some kind of acceleration. That is, a kind of specialized ASIC that the CPU does its math offload.

Perhaps, IBM is re-learning some lessons from that early HPC era a decade and a half ago, when it created the PowerXCell vector math accelerator and used it in the petaflop-capable “Roadrunner” supercomputer at Los Alamos National Laboratory, and is applying those lessons for the modern age of artificial intelligence.

One can hope that, at least, just to keep things interesting in the AI ​​arena, the company will take itself seriously in at least some sort of HPC (which is AI training for sure) as its IBM research arm appears to be. You do with the new AI acceleration module you’ve unveiled.

Not many details behind IBM Research’s AIU have been revealed, and so far the only thing anyone has is some history of IBM’s matrix and vector math units (which are not at all computational lax) and their use of mixed precision and A blog post talking about AIU specifically to go by.

The AIU unveiled by IBM Research will be based on a 5nm process and supposedly manufactured by Samsung, which is IBM’s partner in 7nm “Cirrus” Power10 processors for enterprise servers and its Telum System z16 processors for its mainframes. The Power10 chips contain very powerful matrix and vector math modules that are an evolution of designs that IBM has been using for decades, but the Telum chip uses IBM Research’s third-generation AI Core AI Core heuristics as the on-chip AI heuristics and AI training accelerator low resolution.

The Initial AI Core chip announced in 2018 He was able to do the math of the half-accuracy FP16 and the accumulation of single-precision FP32 and was instrumental in IBM’s work to bring Even less accurate data and processing for neural networks. After creating an AI accelerator for Telum z16 processor, Which we detailed here back in August 2021IBM Research has taken this AI accelerator as a building block and scaled it up on a single device.

Let’s review the AI ​​accelerator on the Telum chip before getting into the new AIU.

It would have been better if this wasn't just a science project

On the z16 chip, the AI ​​accelerator consists of 128 processor pieces, likely arranged in a 2D phase configuration with dimensions of 4 x 4 x 8 but IBM hasn’t been clear about that. This systolic matrix supports the mathematics of the FP16 matrix (and mixed precision variables) on the FP32 accumulative floating-point units. This is explicitly designed to support matrix and convolutional mathematics in machine learning – including not just inference but low-fidelity training, which IBM anticipates may happen on enterprise platforms. We think it might also support the FP8 quarter-precision format for AI training and inference in addition to INT2 and INT4 for AI inference that we see in An experimental quad-core AI Core chip unveiled by IBM Research in January 2021 For compact and portable devices. Telum’s CPU AI accelerator also contains 32 complex functions (CF), which support FP16 and FP32 SIMD instructions and are optimized for activation functions and complex operations. The list of supported special functions includes:

  • Activate LSTM
  • GRU . activation
  • Molten matrix multiplication, reference bias
  • double molten matrix (broadcast/broadcast)
  • Batch normalization
  • Fused torsion, Bias addition, Relu
  • Maxball 2 d
  • Average 2D pool
  • Soft Max
  • Real
  • sah
  • sigmoid
  • Add
  • offer or discount
  • multiply
  • swears
  • minute
  • the above
  • register

The prefetch unit and the rewrite unit are linked into the z16 core loop and the L2 cache, and also the links in the zero board which in turn are linked to the AI ​​core through the data transfer and coordination unit, which as the name suggests formats the data so that it can run through the matrix math unit to do Inference and get the result. The prefetch can read data from the storage board at a speed of more than 120GB/s and can store data in the storage board at a speed of more than 80GB/s; The data engine can pull data and push data from PT and CF centers into an AI module at a speed of 600 Gb/s.

on iron system z16IBM’s Snap ML framework and Microsoft Azure’s ONNX framework are in production, and Google’s TensorFlow framework has recently been in open beta for two months.

Now, imagine that you copied this AI accelerator from a Telum chip and pasted it into a design 34 times, like this:

It would have been better if this wasn't just a science project

These 34 cores and their non-core regions for storage, interconnecting cores, and the external system have a total of 23 billion transistors. (IBM says there are 32 cores in the AIU, but you can clearly see 34 cores, and so we think two of them are there to increase chip throughput on machines with 32 usable cores.)

Telum z16 processors weigh in at 5GHz, but the AIU isn’t likely to run anything close to that speed.

If you look at the AIU template, it has sixteen I/O controllers of some sort, which are probably generic SerDes that can be used for memory or I/O (as IBM did with their OpenCAPI interfaces for I/O and memory in the Power10 chip ). There seems to be Eight banks of Samsung LPDDR5 memory On the package too, that would be a total of 48GB of memory and provide about 43GB/s of total bandwidth. If these are all memory controllers, the memory can be doubled up to 96 GB/s and 86 GB/s total bandwidth.

The controller assembly at the top of the AIU die is likely a PCI-Express 4.0 controller, but hopefully a PCI-Express 5.0 controller with CXL protocol support built in.

IBM hasn’t said what kind of performance to expect with the AIU, but we can make some guesses. Back in January 2021, Quad-core AI Core chip debuted at ISSCC chipsetengraved by Samsung at 7 nm, which provided 25.6 teraflops of training FP8 and 102.4 teraflops of INT4 inference running at 1.6 GHz.

This AIU has 34 cores with 32 of them active, so its performance should be 8X assuming the clock speed remains the same (whatever that is) and 8X the on-chip cache. This will run on 204.8 teraflops for AI training in the FP8 and 819.2 teraflops for AI inference with 64MB of on-chip cache, in something south of a 400W power envelope when implemented at 7nm. But IBM is implementing it with Samsung at 5nm, and that probably puts the AIU at around 275W.

By comparison, the 350W PCI-Express 5.0 version of Nvidia’s “Hopper” GH100 GPU delivers 2TB/s of bandwidth over 80GB of HBM3 memory and 3.03 petaflops of FP8 AI training performance with sparse support.

IBM Research will need AI cores. Lots of AI cores.

Newsletter Subscription

View highlights, analysis and stories from the week from us straight to your inbox with nothing in between.
subscribe now

Source

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Exclusive content

Latest article

More article