Kernel: the Mechanical Maestro

Tags: Kernel • Linux • Race Condition • Privilege Escalation

This report explores the fundamentals of kernel exploitation, the motivations behind targeting the kernel, and the mechanisms by which attackers exploit vulnerabilities. To situate these concepts in the context, it includes a detailed case study of Dirty COW (CVE-2016-5195), a high-profile Linux kernel vulnerability that allowed unprivileged users to modify read-only files by exploiting a race condition in the kernel's copy-on-write mechanism. Through this case study, the report demonstrates how kernel-level vulnerabilities can be leveraged, what makes them particularly dangerous, and the importance of modern mitigation strategies in defending against such attacks.

// What actually is the Kernel?

At the risk of oversimplifying: the kernel is a consultant that programs go to when they want to do something; it does everything from allocating memory, scheduling which process gets the CPU, deciding whether a file can be opened or a socket can be created to translating high-level calls into hardware instructions. From the perspective of a user process, it is the final authority. If the kernel refuses, the action does not happen. If the kernel allows, it happens with the full backing of the machine.

The Job

Process Scheduling: decides which thread runs next, on which CPU, and under what priority.
Memory Management: carves up virtual address space, manages page tables, enforces isolation, handles copy-on-write, and decides when to page, map, or kill.
System Call Interface: the contract surface between untrusted user code and trusted kernel services. syscall/sysenter in; validated, privileged actions out.
Driver Mediation: turns messy, vendor-specific hardware behavior into stable abstractions: DMA, MMIO, IRP/IOCTL.
Policy Enforcement: access checks, capabilities, SELinux/AppArmor/LSM hooks on Linux; tokens/ACLs/Integrity Levels on Windows.
IPC & Filesystems: pipes, sockets, shared memory, VFS; the boring miracles that make multitasking possible.

Modes & Rings

At the hardware level this authority is enforced by privilege rings. Most of what you run: browsers, editors, games; lives in Ring 3, or user mode, where direct access to hardware is forbidden. The kernel sits in Ring 0, or kernel mode, with full power to manipulate memory tables, interrupt handling, and hardware I/O. The boundary between the two is the system call interface: the carefully controlled entry point through which untrusted user code asks for trusted services. Crossing that boundary is routine (every time a process opens a file or sends a network packet it happens) but it is also perilous. A flaw in validation here turns into a direct bridge from unprivileged to absolute control.

What the Kernel is Not

It is important not to stretch the definition too far (a common conceptual mistake). The kernel is not the absolute lowest layer of the system. Beneath it lie the boot firmware, which initializes the machine, and sometimes a hypervisor that virtualizes hardware for guest systems. These sit conceptually at "Ring -1", with their own specialized threats (such as bootkits; a topic I'll review in depth in a future article), that may or may not be delivered through kernel exploitation. Nor is the kernel every daemon or background service your operating system launches. Its definition is very precise: the trusted core of the OS and its loadable modules or drivers, executing with the highest privilege and answerable to no higher authority. Drivers deserve special mention because while they extend the kernel's reach to hardware, they also expand its attack surface (signed or not, they often carry bugs of their own).

Why this Architectural Split Exists

Separating user mode from kernel mode brings two essential benefits: safety and performance. On the safety side, user processes can crash, leak memory, or spin out of control without dragging the whole system down. On the performance side, the kernel keeps exclusive authority over critical resources, preventing the anarchy that would follow if every application could directly drive the disk or scribble over memory. Returning to the maestro analogy, the kernel ensures that even if one instrument falters, the orchestra keeps playing in harmony anyway.

// Why Attack the Kernel?

Despite obviously being a prime target, I think it would be good to explain exactly why do attackers go after the kernel through 4 main lens.

Persistence

The first reason is persistence. Malware in user space lives at the mercy of reboots, updates, and endpoint detection agents that can quarantine or kill its processes. Kernel-level implants, however, survive across system restarts, often embedding themselves as drivers or patched system components. With the right registry keys or module-loading hooks, they can quietly reload every time the machine powers up.

Escalation

The second is privilege escalation. A low-privileged foothold (code execution as a common user) can be elevated directly to SYSTEM or root if the kernel is vulnerable. Suddenly an attacker who could only read a directory can now write in it, disabling protections, tampering with security logs or even creating new root users. So, with kernel expolits, the privilege escalation does not happen incrementally, as it is mostly done; it happens binarrily.

Stealth

The third is stealth. Once in the kernel, malicious code has the ability to hide its tracks with far more sophistication than in userland. Processes can be made invisible to task managers, files can be hidden from explorers, and hooks can alter what monitoring agents "see".

Escape

And finally, there is escape. Many environments today rely on sandboxes, virtual machines, or containers to confine potentially dangerous code. The kernel, however, is the ultimate boundary. Exploiting it can allow malicious code to break free of its sandbox, reaching the host operating system or neighboring workloads. In cloud and multi-tenant contexts, that kind of breakout turns from a local compromise into a systemic one.

// How are Kernels Exploited

Main vulnerability families

Stack/Heap Buffer Overflows

One classic flaw is the buffer overflow, where the kernel reads or writes more data than a buffer was designed to hold. In kernel space, the consequences are more dire than in user space: overflows can smash critical system-control structures, redirect execution, or even give the attacker direct access to arbitrary addresses.

Use-After-Free (UAF)

Another is the use-after-free, where memory that has been released is still accessible through a dangling pointer. If an attacker can allocate new data into that same memory region, they can hijack the kernel's control flow or overwrite sensitive fields.

Race Conditions (TOCTOU)

Race conditions are equally dangerous. A common form is time-of-check to time-of-use (TOCTOU), where the kernel validates a resource at one moment (t1, for example) and then acts on it at another (t3), leaving a window (t2) for an attacker to swap the already checked file for any other (which won't be checked before execution).

Null Pointer Dereferences

It sounds trivial, but if the kernel dereferences a null pointer and attackers can map page zero (or close to it), they can smuggle their own structures into the dereference path. Modern mitigations often forbid mapping null pages, but this class has produced some infamous exploits historically.

Logic Bugs

Not all bugs are memory-related. Sometimes the kernel simply gets its logic wrong (failing to enforce a privilege check, or trusting input from user space that should never have been trusted). These mistakes may be less flashy than memory corruption, but they can sometimes be just as effective.

From vulnerability to Exploit

The immediate goal of kernel exploitation is not always necessarily to drop a full blown rootkit in one shot, but oftentimes to acquire the building blocks, commonly being: the ability to read/write arbitrarily in memory (I/O Primitives), or to control execution flow. With those, the attacker can then chain steps together: overwrite security tokens or credential structures, replace function pointers, disable defenses, and finally pivot into SYSTEM.

Exploitation Chain Flowchart

// Mitigations (and How Attackers Bypass Them)

If kernel exploitation is such a powerful weapon, why don't we see every attacker dropping a rootkit every other week? The answer is mitigations. Modern operating systems ship with a plethora of defenses specifically designed to make kernel exploitation harder, riskier, and noisier.

Mitigation Examples

Let's walk through some of the key ones and see how they are usually bypassed:

Kernel Address Space Layout Randomization (KASLR)

KASLR scrambles where the kernel and its structures live in memory, denying attackers the ability to hardcode addresses of functions and sensitive structures. If you don't know where the kernel is or will be, you can't reliably redirect execution from another program into it.

Bypass: KASLR is only as strong as the absence of leaks. A single leak of a function pointer, stack address, or symbol table can de-randomize the layout entirely (as nothing in computer science is completely random). Attackers often chain an info leak bug with a memory corruption bug to gain precision.

SMEP and SMAP (Supervisor Mode Execution/Access Prevention)

Intel introduced Supervisor Mode Execution Prevention (SMEP) and Supervisor Mode Access Prevention (SMAP) to limit what code in Ring 0 can do. SMEP blocks the kernel from executing instructions in user space, while SMAP prevents it from directly reading or writing user space memory. These are meant to stop an attacker from simply jumping back into userland shellcode or shoving fake data into kernel decision paths.

Bypass: Attackers respond with Return-Oriented Programming (ROP), stitching together legitimate snippets of kernel code ("gadgets") into a payload that never leaves Ring 0 into userland. Alternatively, some kernel exploits focus on obtaining arbitrary read/write primitives in kernel memory iself, so that all malicious logic remains in kernel space without triggering SMEP/SMAP checks.

PatchGuard and Kernel Integrity Checks

On Windows, PatchGuard (Kernel Patch Protection) tries to defend against persistent tampering. It periodically verifies that certain sensitive structures (like system call tables or interrupt descriptors) haven't been altered. If it detects unauthorized changes, it forces a system crash. On Linux, different distributions rely on LSMs (Linux Security Modules), integrity frameworks, or third-party monitoring.

Bypass: Ironically, many attackers sidestep PatchGuard entirely by abusing signed but vulnerable drivers. If Microsoft or a hardware vendor signs a buggy driver, attackers can load it with full kernel privileges and use it as a trampoline for exploitation. Since the driver is trusted by default, PatchGuard doesn't complain; even if the driver is doing insane memory operations. This "bring your own vulnerable driver" (BYOVD) technique is actually one of the most popular real-world bypasses.

Control-Flow Integrity (CFI) and Shadow Stacks

More recent defenses attempt to prevent attackers from arbitrarily hijacking execution. Control-Flow Integrity (CFI) enforces that indirect calls and jumps land only on legitimate targets. Meanwhile, shadow stacks keep a protected copy of return addresses to prevent ROP.

Bypass: CFI itself is very difficult to bypass; but the catch is that it depends on how strictly the compiler enforces the rules. If the kernel allows multiple valid targets, an attacker might still redirect execution to a gadget that's technically legal but malicious in context. Shadow stacks also don't stop data-only attacks (such as modifying security tokens or privilege fields) since those never rely on hijacking control flow.

// Case Study: Dirty COW (CVE-2016-5195)

Few kernel vulnerabilities have captured public attention the way Dirty COW did. Discovered in 2016 but present in Linux since at least 2007 (!!), this bug was nicknamed after its root cause: a critical flaw in the kernel's Copy-On-Write (COW) mechanism.

It was simple, reliable, cross-distro, and devastating: an unprivileged local user could write to read-only files. That meant overwriting critical system files (like /etc/passwd) without root permissions, and then escalating to full root access.

The Code

dirty-cow-code — Dirty COW exploit code example taken from this public repo (comments and imports stripped)

Simply put, this program races two actions against a MAP_PRIVATE (copy-on-write) mapping of a read-only file: one thread continuously calls madvise(..., MADV_DONTNEED) to cause the kernel to drop/repopulate pages, while another thread repeatedly writes new bytes into the process's memory by writing to /proc/self/mem at the mapping address. Because of a race in the kernel's handling of COW / pagecache and madvise, the userland write sometimes ends up contaminating the backing file (the pagecache) and later gets written back to disk (allowing an unprivileged user to change a file they do not have write permission to).

main()

madviseThread(void *arg)

dirty-cow-madviseThread — **`madvise(map, 100, MADV_DONTNEED)`** tells the kernel the pages starting at map (100 bytes) are not needed, so that the kernel will drop them from memory (invalidate the mapping) so future accesses will re-fault and repopulate pages from the file/pagecache (if this doesn't happen, we end up not triggering the race condition, as the writing on the file will be done only one time).

The thread calls madvise in loop to continuously create windows where the kernel is discarding/repopulating pages. That increases the probability of hitting the tiny timing window required by the race.

**Effect:** `MADV_DONTNEED` causes the kernel to drop the current physical page(s) for that virtual range so the next access will get fresh pages (page miss -> get fresh pages -> page miss -> ...).

procselfmemThread(void *arg)

dirty-cow-procThread — **`/proc/self/mem`** is a special file that provides access to the process's own address space. Seeking to a virtual address and writing will cause the kernel to write those bytes into the process memory at that address.

The thread repeatedly seeks to the virtual address map and writes the attacker string there.

Because map is mapped `MAP_PRIVATE` and `PROT_READ`, userland writes to map normally would be blocked or would trigger copy-on-write behavior; but writing through /proc/self/mem is a kernel-mediated path that can modify the process memory under certain conditions.

**Effect:** this thread is the “write” half of the race; here it tries to place data into the mapped virtual address at the exact moment the kernel is performing the madvise/COW activity.

TLDR

dirty-cow-flowchart — Here, we have to keep in sight that the order does not matter too much because the threads are running simmultaneously and *literally a hundred million times each*

Program maps file f into memory using `map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0)`.

Program writes into a copy of the file we mapped into memory`map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0)`.

Kernel page-misses (because we told him to drop the first 100 bytes of the mapped file in memory) here `madvise(map,100,MADV_DONTNEED)` and goes back to the memory to map it again (because we wanted to write in it at the third step).

The write is happening while the kernel is juggling page ownership / COW logic; the kernel executes privileged code that ends up committing user-supplied bytes to the backing file. That privileged code path did not correctly enforce the semantics or hold locks, enabling the unprivileged change.

Simplified Timeline of Events

Initial state: MAP_PRIVATE mapping points to a page in the pagecache that represents the file contents.
Attacker calls /proc/self/mem write: this triggers a kernel code path that writes into the process's address space at map. Under normal semantics a write to a private mapping should create a private copy (COW) so the backing file is not changed.
Concurrently madvise(MADV_DONTNEED): this call causes the kernel to drop the page and (depending on exact timing and internal locking) re-populate/reconnect the mapping to a pagecache page.
Buggy interleaving: due to insufficient synchronization in the kernel's handling of the pagecache / COW / madvise combination, the write that should have ended up in a private anonymous copy sometimes instead gets applied to the pagecache page (i.e., the very page that backs the file). Once the pagecache is dirtied, it may be written back to disk, altering the underlying file even though the user has no file write permission.
Result: the disk file's contents are changed by kernel writeback, enabling modification of read-only files (e.g., replacing root's files) by an unprivileged local user.

Thanks for reading. See ya.

// Back

To Blog

To Main Page