“In collaborations with researchers at the National Energy Technology Laboratory (NETL), Cerebras showed that a single wafer-scale Cerebras CS-1 can outperform one of the fastest supercomputers in the US by more than 200 X. The problem was to solve a large, sparse, structured system of linear equations of the sort that arises in modeling physical phenomena—like fluid dynamics—using a finite-volume method on a regular three-dimensional mesh. Solving these equations is fundamental to such efforts as forecasting the weather; finding the best shape for an airplane’s wing; predicting the temperatures and the radiation levels in a nuclear power plant; modeling combustion in a coal-burning power plant; and or making pictures of the layers of sedimentary rock in places likely to contain oil and gas.
Three key factors enabled the successful speedup:
The memory performance on the CS-1.
The high bandwidth and low latency of the CS-1’s on-wafer communication fabric.
A processor architecture optimized for high bandwidth computing.
We will be presenting this work at the supercomputing conference SC20 this month. A paper describing it has already been uploaded to arXiv. This work opens the door for major breakthroughs in scientific computing performance. For certain classes of supercomputing problems, wafer-scale computation overcomes current barriers to enable real-time and faster than real-time performance, and other applications that may otherwise be precluded by the failure of strong scaling in current high-performance computing systems.
HPC on the CS-1
Cerebras has built the world’s largest chip. It is 72 square inches (462 cm2) and the largest square that can be cut from a 300 mm wafer.
The chip is about 60 times the size of a large conventional chip like a CPU or GPU. It was built to provide a much-needed breakthrough in computer performance for deep learning. We have delivered CS-1 systems to customers around the world, where they are providing an otherwise impossible speed boost to leading-edge AI applications in fields ranging from drug design to astronomy, particle physics to supply chain optimization, to name just a few applications.
In addition to this, we have done something a bit surprising. We used the CS-1 to do sparse linear algebra, of the kind typically used in computational physics and other scientific applications. Working with colleagues at NETL, a Department of Energy research center in West Virginia, we took a key component of their software for modeling fluid bed combustion in power plants and implemented it on the Cerebras CS-1.
The result? Using the wafer, we were able to achieve a performance more than 200 times faster than that of NETL’s Joule 2.0 supercomputer, which is an 84,000 CPU core cluster. The Joule supercomputer is the 24th fastest supercomputer in the U.S. and 82nd fastest in the world. In this post we explain what we did and how we did it, as well as the specific aspects of the wafer that made it possible. We also discuss some possible implications of this work.
HPC Model of a Power Plant
Imagine the interior of a power plant’s combustion chamber, which is a rectangular space with some height, width, and depth. We want to know the details of temperature, chemical concentrations, and fluid (air) movement throughout the chamber, over time, as fuel burns. Now think about subdividing the space into little rectangular cells—like stacks of sugar cubes. Let’s say that we have filled the chamber with a 600 × 600 × 1500 packing of 540 million little cells. We will model the physics by keeping track of the relevant quantities, like temperature, fluid velocity, partial pressure of oxygen, etc., with one average value in each cell. This is called a finite volume method.
We use the non-linear Navier-Stokes equations that describe the motions of fluid to set up a system of linear equations that describe the fluid’s momentary dynamic. The solution to the linear equations gives updated physical quantities in each cell for use in the next time-step. In this way, the computational model simulates the operation of the power plant, starting from some known initial state, as time moves forward.
Most computation time on a supercomputer goes towards solving sets of linear equations for sets of unknowns. In the language of computational science and linear algebra, we solve Ax = b, where A is a given square matrix, b is a given vector, and x is a vector of unknown elements. In computational science, there may be billions of equations and unknowns. It should be noted that the matrix A is quite special: it is exceedingly sparse, having only a handful of nonzero elements per row or column.
Because of the sparsity and size of the matrix, iterative methods are frequently used to solve these systems. Often, the specific iterative method used is from the class of Krylov subspace methods. NETL uses one of these, called the Biconjugate Gradient Stabilized method, or BiCGSTAB.
Wafer-Scale Supercomputer
The Cerebras CS-1 is the world’s first wafer-scale computer system. It is 26 inches tall and fits in a standard data center rack. It is powered by a single Cerebras wafer (Figure 1). All processing, memory, and core-to-core communication occur on the wafer. The wafer holds almost 400,000 individual processor cores, each with its own private memory, and each with a network router. The cores form a square mesh. Each router connects to the routers of the four nearest cores in the mesh. The cores share nothing; they communicate via messages sent through the mesh.”