magic is real and its in a C compiler

2026-02-02

A while back I was obsessed with low level programming, I wrote my own little kernel that went nowhere and I learnt all kinds of assembly. During that phase, I learned that unless you are an ffmpeg dev there is no need for you to be hand rolling assembly. Modern compiler do some insane optimizations that seem incomprehensible, is it some kind of math trick or something specific to the hardware architecture? you never even know, but usually its a math trick.

I really wanted to explore optimizations, and the end result was the I was somewhat able to write compiler grade assembly for a project euler problem, specifically the first one. The problem statement is quite simple.
Pasted image 20260202004840.png

even a novice should be able to solve this with much ease, all you do is check if it is a multiple of 3 or 5 and we add.

N = 1000
sum = 0
for i in range(N):
    if i%3 == 0 or i%5 == 0:
        sum+=i

print(sum)

In human-grade assembly, the solution would look something like this:

    .file   "euler1.s"
    .text
    .globl  main
    .type   main, @function

main:
    pushq   %rbp
    movq    %rsp, %rbp

    xorl    %ecx, %ecx        # i = 0
    xorl    %ebx, %ebx        # sum = 0

.loop:
    cmpl    $1000, %ecx
    jge .done

    # check i % 3 == 0
    movl    %ecx, %eax
    cdq
    movl    $3, %esi
    idivl   %esi
    testl   %edx, %edx
    je  .add

    # check i % 5 == 0
    movl    %ecx, %eax
    cdq
    movl    $5, %esi
    idivl   %esi
    testl   %edx, %edx
    jne .next

.add:
    addl    %ecx, %ebx

.next:
    incl    %ecx
    jmp .loop

.done:
    movl    %ebx, %eax        # return sum
    popq    %rbp
    ret

And while this may seem completely unreadable, for anyone familiar with the syntax and the CPU architecture, this bit of code is very intuitive.

what the assembly does

main: it is the entry, what did you expect?
pushq %rbp and movq %rsp, %rbp to setup stack
xorl to initialize a variable to zero
.loop: implements a loop, jge for the end condition
movl %ecx, %eax, cdq, and idivl compute the remainder:
This is where the core of the program lies, you move i into %eax, then sign-extend for division, then you load the divisor into %esi and do your idivl which stores the remainder in %edx
we do this for both 5 and 3
then we add and increment loop counter
once its done, exit.

This bit of assembly is very readable and human-friendly, however if you wrote the equivalent C program and used a modern C compiler, you'd get something incomprehensible.

What is the GCC/Clang magic?

Well the first magic for me was how it avoids using idiv and uses a completely different approach to compute remainder. idiv is usually going to take 20-30 cycles on most modern CPUs, however a multiplication and a bit shift is nearly instant with 3-5 cycles. But how the fuck do we compute remainder with a multiplication and a shift instruction. Well, like this:

imulq $1431655766, %rax
shrq  $32, %rax

That is a magic constant, for any constant divisor d, the compiler will compute a magic number, such that upon multiplication and shift by 32, it returns your quotient.

magic = floor(2^32 / d)
q ≈ (n * magic) >> 32

For some of you wizards this maybe basics of assembly optimizations, but for me this opened a whole new world of optimizations. It gets even more weird, it tries to squeeze out every bit of value. That was mostly it for the project euler problem but here are some other interesting optimizations I have come across.

The even divisor magic spell

Assume your C code has something like this:

int q = n / 6;

The compiler will factor the divisor and rewrite it as

q = (n / 3) >> 1;

and then we use the magic constant again to speed up

magic_3 = floor(2^32 / 3) = 1431655766

In the end we get something much difficult to read, but also significantly faster:

movsxd  %edi, %rax              # rax = n
imulq   $1431655766, %rax, %rax # rax = n * magic_3
shrq    $32, %rax               # rax ≈ n / 3
sarl    $1, %eax                # rax = (n / 3) >> 1 = n / 6

I find it fascinating that compilers can go this far just for basic optimizations.

Since I was so fascinated by this, I decided to look into the compiler source code to see how they did it. And I want to share you a simple snippet from gcc/expmed.cc

unsigned HOST_WIDE_INT
choose_multiplier (unsigned HOST_WIDE_INT d, int n, int precision,
                   unsigned HOST_WIDE_INT *multiplier_ptr,
                   int *post_shift_ptr)
{
  int lgup, post_shift;
  int pow1, pow2;

  /* lgup = ceil(log2(d)) */
  lgup = ceil_log2 (d);
  gcc_assert (lgup <= n);
  gcc_assert (lgup <= precision);

  pow1 = n + lgup;
  pow2 = n + lgup - precision;

  /* mlow = 2^(n + lgup) / d */
  wide_int val = wi::set_bit_in_zero (pow1,
                                      HOST_BITS_PER_DOUBLE_INT);
  wide_int mlow = wi::udiv_trunc (val, d);

  /* mhigh = (2^(n + lgup) + 2^(n + lgup - precision)) / d */
  val |= wi::set_bit_in_zero (pow2,
                              HOST_BITS_PER_DOUBLE_INT);
  wide_int mhigh = wi::udiv_trunc (val, d);

  /* Reduce to lowest terms */
  for (post_shift = lgup; post_shift > 0; post_shift--)
    {
      unsigned HOST_WIDE_INT ml_lo
        = wi::extract_uhwi (mlow, 1,
                            HOST_BITS_PER_WIDE_INT);
      unsigned HOST_WIDE_INT mh_lo
        = wi::extract_uhwi (mhigh, 1,
                            HOST_BITS_PER_WIDE_INT);
      if (ml_lo >= mh_lo)
        break;
      mlow = wi::uhwi (ml_lo,
                       HOST_BITS_PER_DOUBLE_INT);
      mhigh = wi::uhwi (mh_lo,
                        HOST_BITS_PER_DOUBLE_INT);
    }

  *post_shift_ptr = post_shift;

  if (n < HOST_BITS_PER_WIDE_INT)
    {
      unsigned HOST_WIDE_INT mask
        = (HOST_WIDE_INT_1U << n) - 1;
      *multiplier_ptr = mhigh.to_uhwi () & mask;
      return mhigh.to_uhwi () > mask;
    }
  else
    {
      *multiplier_ptr = mhigh.to_uhwi ();
      return wi::extract_uhwi (mhigh,
                               HOST_BITS_PER_WIDE_INT,
                               1);
    }
}

What this magic spell is doing:

computing a fixed point approximation to 1/d

mlow  = 2^(n + lgup) / d
mhigh = (2^(n + lgup) + 2^(n+lgup-precision)) / d

this is just shrinking the candidate range

mlow <= exact(2^(n+lgup) / d) < mhigh

then it finds a multipler m and a right shift such that

x/d == (x * m) >> post_shift

This thing alone was pretty difficult for me to read through, I still dont understand how rest of the magic has been implemented, maybe the GNU coding standards are just ass making everything unreadable or maybe the compiler just implements advanced algorithms that I am unable to comprehend. Regardless, a modern C compiler is truly a marvel of human engineering.