heaven is real and it runs on a microkernel

2026-02-03

If you take any interest in operating systems, you probably have heard at least one person that keeps reminding you how superior a micro-kernel is. Today that person will be me. But instead of dumping theory and yap, I will show you exactly why I find it the superior architecture.

I spent a lot of time figuring out how exactly I am going to demonstrate this, so I needed my test subjects, I considered the following operating systems and kernels :

linux
serentiyOS
sel4
xv6
minix
helios
and a few more

And I decided to use xv6-riscv, the monolithic scapegoat and sel4, my microkernel beloved for my demonstration. But what am I going to demonstrate? and for what? why?

what, why and how

I have always found the idea that a kernel indeed can be something very tiny and that all your drivers can run in user space. As someone who's understanding of an operating system was defined by UNIX-like operating systems that are monolithic in nature, this idea was fascinating to me. The reason why I am so attached to this concept is because I have had to suffer watching my operating system crash due to a faulty driver or a non vital hardware malfunction. I watched countless blue screens of deaths and countless kernel panics throughout my hacking journey, now for normal people this probably does not matter, as they can always just reboot. But frankly, if you know exactly what you are doing, and your system crashes because of something that you didn't write, but assumed was correct can really ruin your day. Week even.

So I want to implement something that is low level which your user-space program will use, I want to implement this in both xv6-riscv and sel4. The difference will be that the low level thing will be a kernel syscall in the monolithic implementation and a user-space program in a microkernel. The low level thing will be intentionally faulty, however the difference is, since xv6-riscv is making a system call, the kernel panics, but in sel4 just the user-space crashes. That's it, that's the demonstration. Before I get into the details, I just want to say, I only recently learned the details of sel4, so my implementations may not be "the right way to do things", but it proves my point so whatever.

the low-level thing
Here I will introduce you xmap. Think of it as a user-space VMA and a mapper. It should allow a process to:

reserve virtual memory regions without immediately allocating physical memory
map physical pages to those regions on demand
unmap regions to free the physical backing

the thing

typedef struct {
    uint64 start;
    uint64 length;
    int    prot;
} xmap_region_t;

xmap_region_t xmap_reserve(uint64 size);
int xmap_map(xmap_region_t r, int backing);
int xmap_unmap(xmap_region_t r);

how I implemented this in sel4
sel4 already has a very low-level VM system, which is capability-based, normally to map you need to juggle around capabilities, page tables and untyped memory. This turned out to be harder than I thought, because of course this was my first time dealing with capabilities first hand. The docs were surprisingly good for something so low-level.

I want a program that implements a virtual memory abstraction, which is kinda faulty, in userspace. I decided to call it xmap for no reason at all. I start with some of the core data structures, which are simple for the sake of demonstrations.

xmap_region_t: represents a reserved vm region
vma_entry_t: tracks virtual memory reservations
frame_entry_t: tracks physical frames backing virtual pages
paging_struct_t: tracks paging structs created on demand

Now time for the sel4 experience, which gave me alot of trouble. We start by bootstrapping and setup capability allocations, this is just boilerplate stuff, this is where it really starts,

static void xmap_init(void) {
    if (xmap_initialized) return;

    bootinfo = platsupport_get_bootinfo();
    if (!bootinfo) return;

    next_free_slot = bootinfo->empty.start;
    current_untyped_idx = 0;
    memset(vma_table, 0, sizeof(vma_table));
    memset(frame_table, 0, sizeof(frame_table));

    xmap_initialized = 1;
}

There is nothing special happening in here, it just starts with a clean VMA and frame tables, now this is where the true implementation lies, and this will be the faulty thing that will crash.

int xmap_map(xmap_region_t *r, int prot) {
    if (!r || r->length == 0) return -1;
    if (!xmap_initialized) xmap_init();

    int vma_slot = -1;
    for (int i = 0; i < MAX_REGIONS; i++) {
        if (vma_table[i].in_use && vma_table[i].start == r->start) {
            vma_slot = i; break;
        }
    }
    if (vma_slot < 0) return -1;

    /* BUG: No check for already mapped. Calling xmap_map() twice leaks frames. */

    size_t num_pages = r->length / PGSIZE;

    for (size_t i = 0; i < num_pages; i++) {
        uintptr_t vaddr = r->start + (i * PGSIZE);

        int frame_slot = -1;
        for (int j = 0; j < MAX_PAGES; j++) {
            if (!frame_table[j].in_use) { frame_slot = j; break; }
        }
        if (frame_slot < 0) return -1;

        seL4_CPtr frame_cap = allocate_frame_cap();
        if (frame_cap == seL4_CapNull) return -1;

        map_frame_at_vaddr(frame_cap, vaddr, prot);

        frame_table[frame_slot].frame_cap = frame_cap;
        frame_table[frame_slot].vaddr = vaddr;
        frame_table[frame_slot].in_use = 1;
    }

    r->prot = prot;
    vma_table[vma_slot].mapped = 1;

    return 0;
}

All it does it take a reserved region r and map physical frames to it, It assigns actual memory and sets access permissions.

The big deal here is that when you call xmap_map(&region, prot) it will allocate new physical frames, it maps them into the virtual addresses in the region, then we mark it mark it as mapped. But we do something stupid here, we never actually check if the region was already mapped, so we allocate new frame in the same virtual address, the frames never get freed and wowe look at that we have our very own memory leak.

This was just some implementation details for sel4, now I want to explain what I did for xv6-riscv

xmap as a syscall for xv6-riscv

Adding a syscall to xv6-riscv is something I have done before, so this was fairly trivial compared to implementing xmap for sel4, this operating system has a special place in my heart for teaching me kernel development. I want to talk about exactly how I went about doing this but I also want to keep it short.

So I add the traps for the new syscall, usual copy paste stuff, then I define the structures for this. Since it is a kernel syscall, I had to hack in a bunch of new things for this to work, so basically I introduce VMA, fairly trivial

vma_init: just setting metadata of the entries
vma_find_free: returns the index of the first free p->vmas (need it for xmap_reserve)
vma_find_by_addr: look up a VMA by its starting address (need it for mapping and unmapping)
vma_overlaps: checks if a new VMA will overlap with existing VMA

Then I implement the xmap system call functions. The implementations are not 1:1 equivalent of course, but they do the same thing. It even comes with the same bug !!!

int
xmap_map(xmap_region_t r, int backing)
{
    struct proc *p = myproc();
    char *pa;
    uint64 va;

    if(r.start == 0 || r.length == 0) {
        return -1;
    }

    if(r.start % PGSIZE != 0 || r.length % PGSIZE != 0) {
        return -1;
    }

    int slot = vma_find_by_addr(p, r.start);
    if(slot < 0) {
        return -1;
    }

    // BUG: Missing check if already mapped (should check p->vmas[slot].prot != 0)
    // This allows double-mapping which will leak memory

    int perm = backing | PTE_U;

    uint64 npages = r.length / PGSIZE;
    for(uint64 i = 0; i < npages; i++) {
        va = r.start + i * PGSIZE;

        pa = kalloc();
        if(pa == 0) {
            if(i > 0) {
                uvmunmap(p->pagetable, r.start, i, 1);
            }
            return -1;
        }

        memset(pa, 0, PGSIZE);

        if(mappages(p->pagetable, va, PGSIZE, (uint64)pa, perm) != 0) {
            kfree(pa);
            if(i > 0) {
                uvmunmap(p->pagetable, r.start, i, 1);
            }
            return -1;
        }
    }

    p->vmas[slot].prot = backing;

    return 0;
}

So basically in both of the xmap implementations, I never check if its already mapped.

the system level implementations are done, now what?
We write a custom allocator on top of it, of course, what else? At this point I have implemented the low level thing, of course, one of them runs on kernel another in userland. Both of them are faulty. Now lets pretend I don't know that they are faulty, lets pretend I am an innocent userland hacker who assumes the correctness of the xmap implementations.

the innocent userland developer experience
Now, the innocent userland developer writes his arena allocator, correctly. He uses the xmap_map with the assumption that it is a correct implementation and wont crash on him.

int
alloc_init(void)
{
    if(allocator.initialized) {
        return 0;
    }

    if(xmap_reserve(ARENA_SIZE, &allocator.region) < 0) {
        printf("alloc_init: xmap_reserve failed\n");
        return -1;
    }

    if(xmap_map(&allocator.region, XMAP_READ | XMAP_WRITE) < 0) {
        printf("alloc_init: xmap_map failed\n");
        return -1;
    }

    unsigned long npages = ARENA_SIZE / PGSIZE;
    allocator.free_list = (free_page_t *)allocator.region.start;

    free_page_t *current = allocator.free_list;
    for(unsigned long i = 0; i < npages - 1; i++) {
        free_page_t *next_page = (free_page_t *)((char *)current + PGSIZE);
        current->next = next_page;
        current = next_page;
    }
    current->next = 0;

    allocator.initialized = 1;
    printf("Allocator initialized: %ld pages available\n", npages);

    return 0;
}

And now, he tries doing something with his new alloc function.

    void *p1 = page_alloc();
    void *p2 = page_alloc();

    printf("Allocated: %p, %p\n", p1, p2);

    char *c1 = (char *)p1;
    char *c2 = (char *)p2;
    c1[0] = 'A';
    c2[0] = 'B';

    printf("Data: %c, %c\n", c1[0], c2[0]);

    page_free(p1);
    page_free(p2);

    printf("Freed pages\n\n");

It works perfectly, Just like how he expected. But our userland developer is a hacker and wants to play around with things, he wants to learn, he wants to fuck around and find out!

so he tries this,

    xmap_region_t region;
    xmap_reserve(PGSIZE * 2, &region);
    xmap_map(&region, XMAP_READ | XMAP_WRITE);
    printf("First map OK\n");
    xmap_map(&region, XMAP_READ | XMAP_WRITE);
    printf("Second map OK (leaked first mapping)\n");

He compiles his program, not sure what exactly would happen, usually when he tries playing around, he is greeted with a kind SIGSEGV, but that doesnt matter, he can always find the trace with gdb and fix it, so decided to run it anyways.

$ xmaptest
=== xmap allocator demo ===

Allocator initialized: 64 pages available
Allocated: 0x0000003FEFFBE000, 0x0000003FEFFBF000
Data: A, B
Freed pages

Triggering double-map bug...
First map OK
panic: mappages: remap

Unfortunately for our innocent userland developer, he is using an operating system with a monolithic kernel, so this crashed his entire operating system and now he is forced to reboot.

what if he was using a microkernel
I spent way too much time on this, but I learned alot about sel4 so it was worth it. So the way I demonstrate this for sel4 is by basically using their hello world tutorial repo and hacking it to include the xmap, the alloc and the userland program in a single C file. I should have done this properly by setting up a userspace with a simple shell or whatever but instead I just "simulate" userland by using threads.

static void worker_thread_main(void) {
    printf("[WORKER] Started\n");

    void *pages[5];
    for (int i = 0; i < 5; i++) {
        pages[i] = page_alloc();
        if (pages[i]) ((int *)pages[i])[0] = i * 100;
    }
    printf("[WORKER] Allocated 5 pages\n");

    for (int i = 0; i < 5; i++) page_free(pages[i]);
    printf("[WORKER] Freed pages\n");

    xmap_region_t bug_region = xmap_reserve(PGSIZE * 4);
    xmap_map(&bug_region, PROT_READ | PROT_WRITE);
    int *ptr = (int *)bug_region.start;
    ptr[0] = 0xDEADBEEF;
    printf("[WORKER] First map: wrote 0xDEADBEEF\n");

    printf("[WORKER] Second map (BUG: leaks 4 frames)\n");
    xmap_map(&bug_region, PROT_READ | PROT_WRITE);
    printf("[WORKER] Value now: 0x%x (old data lost)\n", ptr[0]);

    printf("[WORKER] Crashing...\n");
    volatile int *bad = (volatile int *)0x0;
    *bad = 42;

    while (1) seL4_Yield();
}

This is my worker thread, which will crash because xmap_map is faulty. The main thread creates this worker and reports that it faulted, this little report is actually just the program humble bragging about how it is still alive, unlike how the xv6 equivalent died.

and yet it lives

=== seL4 xmap Fault Isolation Demo ===

[MAIN] Worker thread created
[MAIN] Waiting for worker fault...
[WORKER] Started

[MAIN] Worker faulted (type=5)
[MAIN] Main thread still alive - fault isolation works!
[MAIN] Kernel stable. Demo complete.

This was my attempt as demonstrating why implementing the same thing in different kernel architectures can lead to completely different experiences in userland when theres a faulty low level program that you have no control over. sel4isolates the fault to my worker, and since the fault was in a userland program completely isolated from the kernel, the kernel does not fucking care that it crashed, rest of the system can keep on running like usual, but when a system call is faulty, like in the xv6-riscv implementations, your kernel panics. And I fucking hate that.

If the gods ever design an operating system, you can be sure it will be a microkernel.

FULL SOURCE CODE HERE

You can try to play around with it, maybe fix the bugs or something.