Main Content

FPGA-based real-time fractal generation. Fully pipelined, dynamic resource allocation, up to 18000 MMUL/s. Float matrix math on J1B CPU.

Fractal generation is a popular FPGA design exercise. For example, Stanford university used it as lab assignment already in 2002. It must have been around that time that I saw a similar demo on some trade fair, and had the itch to try my hands on it ever since.

Well, I got around to it eventually.

It took a long time.

An FPGA free-time project adds one unique design challenge: It needs to be fun all along the way.

Sometimes, the way to the finish line gets straight, obvious and boring. Like hiking versus commuting, it’s not just about getting from A to B in the most efficient manner… So you throw in a new idea to make it interesting again. Rinse and repeat.

The “fun factor driven requirements management” eventually evolved along those lines:

Real-time calculation
Full HD resolution (1920x1080) at 60 Hz
Use the FPGA sensibly: The resulting implementation comes close to one operation per multiplier per clock cycle. That’s 18 billion multiplications per second on the 35-size Artix with a USB bus power budget of ~2 Watts.
Perform dynamic resource allocation. The fractals algorithm is somewhat unusual as the required number of iterations varies between points. Compared to simply setting a fixed number of iterations, complexity increases substantially (a random number of results may appear in one clock cycle, results are unordered) but so does performance.
Limit to 18-bit multiplications because it is the native width for Xilinx 6/7 series DSP48 blocks. It is straightforward to increase the internal bitwidth for higher resolution, but resource usage skyrockets.
Be (reasonably) vendor-independent. I decided to go for the open-source “J1B” soft-core CPU instead of e.g. Microblaze MCS, which would have been very straightforward.
Instead of the original “gforth” toolchain for J1B, I decided to use my own simple compiler / assembler “forthytwo.exe” which got cleaned up along the way.
Floating point math for performance-uncritical code on the CPU.
“For when you gaze long into the rabbit hole, the rabbit hole gazes back into you”.
My own “minimal” float implementation doesn’t try to be as refined or safe as IEEE 754, but is small and does a great job so far.
A CPU Bootloader on plain UART (meaning no proprietary Xilinx JTAG). The included bootloader implements robust synchronization and efficient binary upload.
No esoteric tools, be able to run on Windows (again, Linux would be easier). On a clean Windows PC, the build system can be set up by installing MinGW (developer settings), Vivado and Verilator. See my install notes for the latter. Use e.g. Teraterm with the bootloader.
Batteries-included project, intended for reuse by deleting what is not needed (maybe this is more on the microcontroller side, as the fractals part is quite problem-specific)
Ready/valid design pattern notes
The calculation engine relies heavily on the valid/ready handshaking paradigm, which is used consistently throughout the chain.

Here, it is quite essential, for a simple reason: The 200 MHz clock rate of the fractal generator is less than two times the VGA pixel rate. Any “sub-optimal” handshaking scheme that required an idle clock cycle to recover would break the architecture.

Ready / valid combinational path problem
In a typical processing block, data moves in lock-step through a sequence of registers. When cascading any number of such blocks via ready-/valid interfaces, the “ready” signals form a combinational path from the end to the beginning of the chain. This can make timing closure difficult or impossible. The problem is clear to see when considering what happens when the end of the processing chain signals “not ready” (to accept data): The data over the whole length of the pipeline has nowhere to go, therefore the whole chain must be stopped within a single clock cycle.

The solution is to divide the chain into multiple segments using FIFOs (2 slots is enough). There’s a catch: I could design an “optimized” FIFO that will accept data even if full, as long as an element is taken from the output in the same clock cycle. This “optimization” would introduce exactly the combinational path the FIFO is supposed to break, thus it would be useless for decoupling the combinational chain. In other words, the input of the FIFO may not use the output-side “ready” signal combinationally.

Ready / valid flow control
Maybe this is obvious, but the data flow in a ready-/valid chain can be stalled at any point simply by inserting a block that forces both valid and ready signal to zero (not asserted). This block may be combinational and may depend on an observed data value. This pattern is used to stop the calculation engine from running too far ahead of the monitor’s electron beam.

RTL implementation
Clock domain
There are three clock domains:

The calculation engine (most of the design) at 200 MHz. This could be pushed higher but would make the design more difficult to work with. Right now, there is no reason for doing so, and the chip already runs quite hot.
The VGA monitor signals at a pixel frequency of 148.5 MHz
The J1B CPU at 100 MHz, because its critical path is too slow for 200 MHz operation.”

Link to article