Beyond `malloc(size)`: A Deep Dive into glibc's Heap Internals
Every C programmer starts with `malloc`. It’s one of the first functions we learn for dynamic memory management. We call it, we get a pointer, and we move on. But what *really* happens under the hood when you request a piece of memory? What intricate machinery grinds into action to carve out a few bytes from the vast expanse of your system's memory?
This isn’t just an academic question. Understanding malloc's internals is crucial for high-performance computing, debugging complex memory corruption bugs, and grasping the fundamentals of system-level security.
This document peels back the layers of the GNU C Library's malloc implementation, known as ptmalloc2. We’ll move from high-level concepts to the bits and bytes, and we’ll use the GDB debugger to see it all happen in real-time.
The Grand Hierarchy: Arenas, Heaps, and Chunks
To manage memory efficiently, especially in multi-threaded applications, ptmalloc2 organizes the heap into a strict hierarchy. Think of it like a massive, automated parking garage.
- Arena: The entire parking garage building. Each thread gets its own arena to reduce contention. You don't want threads fighting over the same resources, and arenas are the first line of defense against that.
- Heap: A floor within the parking garage. It's a contiguous block of memory managed by a single arena.
- Chunk: An individual parking spot. This is the smallest unit of memory that
mallocdeals with. Your requested memory, plus some metadata, lives inside a chunk.
A helpful mental model is a diagram showing this relationship: An Arena is the building, Heaps are the floors, and Chunks are the individual parking slots.
Arenas: The Memory Traffic Controllers
There are two types of arenas:
- Main Arena: A singleton, created for the main thread of the process.
- Thread Arenas: Created for any subsequent threads.
The system places a hard limit on the number of arenas to prevent resource exhaustion, typically eight times the number of CPU cores. If a new thread starts and the arena limit is reached, it must wait for an existing arena to become free. This is a critical performance detail—an excessive number of short-lived threads can lead to contention and blocking, not at your application logic, but deep within the memory allocator itself.
How Heaps Grow: sbrk vs. mmap
An arena's heap isn't static; it grows on demand. The mechanism for growth depends on the arena type:
- The Main Arena uses the
sbrksystem call to grow its single heap.sbrkworks by moving the "program break"—the end of the process's data segment—upwards. It's like annexing adjacent, unused land. - Thread Arenas use the
mmapsystem call. This is more flexible, as it can map new, non-contiguous pages of memory from anywhere in the virtual address space to serve as heaps for the arena. This is why a thread arena can manage multiple heap segments.
The Building Block: The malloc_chunk
Every allocation, every free block, is a chunk. The brilliance of ptmalloc2 lies in the metadata stored within each chunk. This metadata is packed into the malloc_chunk struct.
An allocated chunk is simple: a size header followed by the user's data. A free chunk, however, is repurposed to become a node in a linked list.
| Field | Usage in a Free Chunk |
|---|---|
mchunk_prev_size | Size of the previous adjacent chunk (if it's free). |
mchunk_size | Size of the current chunk, including metadata. |
fd | Forward pointer (points to the next free chunk in a bin). |
bk | Backward pointer (points to the previous free chunk in a bin). |
The Secret in the Size Field: Metadata Flags
Since chunk sizes are always aligned to a multiple of 8 or 16 bytes, the last three bits of the mchunk_size field are unused for the size itself. They are repurposed as flags: |A|M|P|.
- P (PREV_INUSE): If this bit is
1, the previous adjacent chunk is allocated. If it's0, the previous chunk is free, and we can find its size by reading themchunk_prev_sizefield. This allows for coalescing—merging adjacent free chunks into a single, larger one. - M (IS_MMAPPED): If
1, this chunk was allocated via a directmmapcall (for very large requests) and is not part of a heap. - A (NON_MAIN_ARENA): If
1, this chunk belongs to a thread arena. If0, it belongs to the main arena.
The Filing System: How free Organizes Chunks into Bins
When you free() a chunk, it doesn't just vanish. It gets placed into a "bin," which is essentially a free list. When you next call malloc, the allocator checks these bins first before asking the OS for more memory. This is the core of fast and efficient memory reuse.
There are five categories of bins.
1. Tcache Bins (Thread-Local Cache)
This is the first and fastest place the allocator looks. The tcache is a recent addition to glibc designed for speed in multi-threaded programs.
- Per-Thread: Each thread has its own tcache.
- Structure: An array of 64 singly-linked lists, each for a specific small chunk size.
- No Locking: Because it's thread-local, placing a chunk into the tcache or retrieving one requires no memory locking. This dramatically reduces contention and is a huge performance win.
- LIFO: It works like a stack. The last chunk freed is the first one to be allocated again.
If the tcache is empty for the requested size, malloc moves on to the main heap bins.
2. Fast Bins
These hold small, recently freed chunks. Like the tcache, they operate in a LIFO manner. The key feature of fast bins is that chunks within them are not coalesced. The allocator sacrifices memory fragmentation for speed, assuming these small chunks will be needed again very soon.
3. Unsorted Bin
This is a temporary holding area. When a chunk that's too large for the fast bins is freed, it lands here. The unsorted bin acts as a sort of "last-chance cache." During the next malloc call, the allocator will process the chunks in this bin, either using one to satisfy the request directly or sorting it into the correct small or large bin for later use.
4. Small and Large Bins
These are the final destinations for sorted free chunks.
- Small Bins: Contain chunks of less than 1024 bytes on 64-bit systems. Each bin holds chunks of a single, fixed size. They are doubly-linked and operate FIFO (First-In, First-Out).
- Large Bins: Hold chunks larger than what small bins handle. Each bin holds a range of sizes, and chunks within are sorted by size. This allows
mallocto find the best-fit chunk rather than just the first-fit.
5. The Last Resort: The Top Chunk
If malloc has searched all the bins and found nothing suitable, it turns to the Top Chunk. This is the special, wild chunk at the very end of the heap. It's the boundary between the used and unused parts of the heap. malloc will simply carve the requested memory out of the Top Chunk. If the Top Chunk isn't big enough, sbrk or mmap is called to extend it, and the allocation proceeds.
Let's See It in Action: A GDB Demonstration
Theory is great, but seeing is believing. We'll use a simple C program and the GDB debugger with the GEF extension to visualize these structures.
You can find the source code for this example here: RootSprout/heaps
Here's the code (heap.c):
#include <stdio.h>
#include <stdlib.h>
int main() {
printf("Welcome to the heap. Let's allocate some chunks.\n");
void *a = malloc(30);
void *b = malloc(30);
void *c = malloc(30);
printf("Allocated a: %p\n", a);
printf("Allocated b: %p\n", b);
printf("Allocated c: %p\n", c);
// Freeing in a specific order to observe LIFO in tcache
free(a);
free(b);
free(c);
printf("Freed a, b, and c.\n");
// Re-allocate to see tcache in action
void *d = malloc(30);
void *e = malloc(30);
printf("Re-allocated d: %p (should match c's address)\n", d);
printf("Re-allocated e: %p (should match b's address)\n", e);
return 0;
}
Compile it with debug symbols (gcc -g heap.c -o heap) and let's fire up GDB.
Step 1: Start GDB and inspect the initial state
Bash
gdb ./heap
gef➤ break main
gef➤ run
The program is now paused at the start. Let's check the arenas.
```Bash
gef➤ heap arenas
You'll see the main arena has been created for our main thread.
Step 2: Allocate three chunks Use the next command (n) to step past the three malloc calls. Now, let's inspect the chunks on the heap.
gef➤ heap chunks
You will see three allocated chunks of size (32 bytes, because our request for 30 bytes is rounded up to the nearest alignment boundary plus metadata) and the ever-present Top Chunk.
Step 3: Free the chunks and inspect the tcache Step past the three free calls. Where did the memory go? Let's check the bins.
gef➤ heap bins tcache
The output will be stunningly clear:
[ TCACHE (per-thread cache) ]
Tcachebins[idx=0, size=0x20] count=3 ← Chunk(addr=0x...) ← Chunk(addr=0x...) ← Chunk(addr=0x...)
Our three freed chunks are now sitting in the tcache for the size class, ready for immediate reuse. Notice the order: c was freed last, so it's at the head of the list.
Step 4: Re-allocate and prove reuse Now, step over the next two malloc calls for d and e. Let's check the final addresses printed by the program. The output will show that d was allocated at c's old address, and e was allocated at b's old address.
This works exactly as expected. The call to malloc(30) for d was served instantly from the head of the tcache. This is the LIFO behavior of the tcache in action—no kernel interaction, no locking, just raw speed.
Conclusion From a simple function call emerges a sophisticated system of arenas, heaps, chunks, and bins, all working in concert to manage memory safely and efficiently. We've seen how ptmalloc2 prioritizes speed, especially in multi-threaded contexts with thread-local caches, while also providing robust strategies for managing fragmentation with its various bins.
Understanding this system isn't just trivia. It's the foundation for writing better, faster, and more secure code. The next time you see a memory-related bug, a performance bottleneck, or a security vulnerability, you'll have a much deeper appreciation for the complex dance happening just beneath the surface of that simple call to malloc.