Introduction
One day I thought about the performance gap between the first Intel processor and modern machines. Of course, we can try to do some estimations empirically – we know clock rate and how the pipeline is organized and what features intel 4004 CPU has (but it would not be standard FLOPS, because there was no embedded support for float numbers yet). But there are few details: architecture bit width (only 4 bits in comparison with modern 64 bits!), very limited instruction set (it’s missing even basic logical operators like AND or XOR) and peripheral limitations (ROM/RAM accesses).
So I decided to research the subject in practice. After some thinking, I chose π number calculation as a benchmark. After all, even ENIAC did that (in 1949) and achieved a new record for the amount of calculated digits1.

Hardware limitations
Usually, we chose hardware, based on our goals. But in that case, we need to choose an algorithm, based on restrictions that come with intel 4004. So what do we have?

CPU is very basic and its instruction set has very few ALU operations – addition/subtraction of 4-bit operands, inversion (NOT operator), rotation left/right. And … that’s all, folks. No multiplications, division or any other logical operators.

Program counter register is 12-bit wide, so we can address 2^12 bytes (4 KiB). Most instructions are 1-byte instructions, but some of them can take 2 bytes of ROM. It means that our program should not be too long. That’s a pity because we would have to implement a lot of missing arithmetical stuff by ourselves (there is no π calculation algorithm that involves only additions and subtractions).

Each MCS-4 RAM chip (4002) has 4 registers with 20 characters (4-bit). Additionally, there were 4 variants of the chip (actually just two metal options with special pin usage to extend it to 4) and the chip is activated when a data bus contains the corresponding chip number 2. CPU can control up to 8 banks (with a simple 3-to-8 demultiplexer, without it – just 4 banks). So the amount of memory that we can use is 8 banks 4 chips 4 registers 20 characters 4 bits = 10240 bits.

Ugh, not that much…

Algorithm
There are plenty of formulas to calculate π. But based on our limitations (mostly RAM) I chose the spigot algorithm of Stan Wagon and Stanley Rabinowitz. It’s simple enough for implementation – only integer division and we don’t need bignum arithmetic (as for Machin-like formulas). So what is the idea behind that algorithm in simple words (if you don’t want to read their wonderful paper)?”