Threading vs Async vs Event Loop
Problem
When your server needs to handle 10,000 concurrent connections, you must choose a concurrency model. The wrong choice wastes CPU, memory, and latency budget. The right choice depends on your workload type (CPU-bound vs I/O-bound) and language runtime.
Common question: "Should I use threads, async/await, or an event loop?"
Answer: "It depends on whether you're CPU-bound or I/O-bound,
and what your language's concurrency primitives actually do."The three models are not competing alternatives — they are complementary tools with different cost structures.
Why It Matters (Latency, Throughput, Cost)
The C10K problem (1999, still relevant):
In 1999, Dan Kegel identified that serving 10,000 concurrent clients with thread-per-connection required 10,000 OS threads. At 1–8MB per thread stack: 10GB–80GB RAM just for stacks. Context switching 10,000 threads: thousands of microseconds per operation.
Modern event-loop and async solutions serve 100,000+ concurrent connections on a single core with megabytes of RAM.
Cost comparison at 10,000 connections:
Thread-per-connection (10,000 threads):
Stack memory: 10,000 × 1MB = 10GB
Context switches: ~50,000/second × 10μs = 500ms CPU/second (50% of a core!)
Async/coroutine (10,000 tasks):
Stack equivalent: 10,000 × 2KB = 20MB (500× less)
Context switches: ~50,000/second × 0.1μs = 5ms CPU/second (0.5% of a core)
Savings: 500× less memory, 100× less CPU overhead for I/O-bound workloads.Mental Model
Threading (OS kernel manages):
┌─────────────────────────────────────┐
│ OS Scheduler │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Thr1 │ │Thr2 │ │Thr3 │ │Thr4 │ │
│ │(run)│ │(wait│ │(wait│ │(run)│ │
│ │ │ │ I/O)│ │lock)│ │ │ │
│ └─────┘ └─────┘ └─────┘ └─────┘ │
│ CPU1 CPU2 │
└─────────────────────────────────────┘
Preemptive: OS can interrupt any thread at any time
Async/Event Loop (application manages):
┌─────────────────────────────────────┐
│ Event Loop (single thread) │
│ │
│ event_queue: [sock1_readable, │
│ sock3_writable, │
│ timer_expired] │
│ │
│ while True: │
│ event = poll_io() # epoll() │
│ callback = handlers[event] │
│ callback() # runs fast │
└─────────────────────────────────────┘
Cooperative: code explicitly yields controlUnderlying Theory (OS / CN / DSA / Math Linkage)
OS Threads: Full Stack Machine
An OS thread is a complete CPU execution context:
- Stack: 1–8MB per thread (default: Linux 8MB, can be reduced with
ulimit -sorpthread_attr_setstacksize) - Registers: RSP (stack pointer), RIP (instruction pointer), 16 general-purpose registers
- TLB entries: Thread's code references virtual addresses; TLB caches virtual→physical mappings
- Kernel resources: Thread Control Block (TCB) in kernel space
Context switch cost:
Context switch steps:
1. Save current thread's registers to TCB (~50 ns)
2. Load next thread's registers from TCB (~50 ns)
3. TLB flush (if different process, or sometimes same) (~200 ns)
4. Cache warmup for new thread's working set (~500 ns - 2μs)
─────────────────────────────────────────────────────────────────
Total: 1-10μs per context switch
At 10,000 threads switching at 1ms intervals:
Switches/second = 10,000 × 1,000 = 10,000,000
CPU time = 10,000,000 × 10μs = 100 seconds of CPU/second
(impossible — this is why thread-per-connection doesn't scale)Green Threads / Goroutines: User-Space Scheduler
Go's goroutines are multiplexed M goroutines onto N OS threads (M:N threading):
Go runtime scheduler:
┌─────────────────────────────────────────────┐
│ M1 (OS thread) M2 (OS thread) │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ P1 (proc) │ │ P2 (proc) │ │
│ │ run queue: │ │ run queue: │ │
│ │ [G3,G5,G7] │ │ [G2,G4,G6] │ │
│ │ currently: │ │ currently: │ │
│ │ G1 │ │ G8 │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Global run queue: [G9, G10, G11, ...] │
│ │
│ Work stealing: P1 steals from P2 when idle │
└─────────────────────────────────────────────┘- Goroutine stack: starts at 2KB, grows on demand up to 1GB (stack copying/segmented stacks)
- Goroutine switch cost: ~100–200ns (no TLB flush, no kernel involvement)
- GOMAXPROCS: number of OS threads = number of CPU cores by default
Async/Await: Cooperative Coroutines
Async functions are state machines compiled by the language runtime. await is a yield point:
# This Python function:
async def fetch_user(user_id):
conn = await db.acquire() # yield point 1
row = await conn.fetchrow(...) # yield point 2
await conn.release() # yield point 3
return row
# Is approximately equivalent to this state machine:
class FetchUserStateMachine:
def __init__(self, user_id):
self.user_id = user_id
self.state = 0
def send(self, value):
if self.state == 0:
self.conn_future = db.acquire()
self.state = 1
return self.conn_future # suspend, return future to event loop
elif self.state == 1:
self.conn = value # resumed with connection
self.row_future = self.conn.fetchrow(...)
self.state = 2
return self.row_future # suspend again
elif self.state == 2:
self.row = value
# ... continueNo new OS thread is created. The coroutine is resumed by the event loop when the awaited I/O completes.
Event Loop: epoll/kqueue Under the Hood
The event loop uses OS-level I/O multiplexing to monitor thousands of sockets with a single syscall:
┌─────────────────────────────────────────────────────────────────┐
│ epoll (Linux) / kqueue (macOS/BSD) │
│ │
│ epoll_create(): create epoll interest list │
│ epoll_ctl(fd, EPOLL_CTL_ADD, events): register socket │
│ epoll_wait(timeout): block until any registered fd is ready │
│ → returns list of ready file descriptors in O(ready_fds) │
│ NOT O(total_fds) — this is the key advantage over select │
└─────────────────────────────────────────────────────────────────┘
Node.js libuv event loop phases:
┌──────────────────────────────────────────┐
│ ┌─────────────────────────────────┐ │
│ │ timers │ │ setTimeout, setInterval callbacks
│ └─────────────────┬───────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────┐ │
│ │ pending callbacks │ │ I/O callbacks deferred to next loop
│ └─────────────────┬───────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────┐ │
│ │ poll (epoll_wait) │ │ Wait for I/O events, execute callbacks
│ └─────────────────┬───────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────┐ │
│ │ check │ │ setImmediate callbacks
│ └─────────────────┬───────────────┘ │
│ │ │
│ ┌─────────────────▼───────────────┐ │
│ │ close callbacks │ │ socket.on('close', ...)
│ └─────────────────────────────────┘ │
└──────────────────────────────────────────┘
Between each phase: process ALL microtasks (Promise.then, queueMicrotask)The GIL (CPython Global Interpreter Lock)
CPython's GIL is a mutex that allows only one thread to execute Python bytecode at a time:
Thread 1: ──[acquire GIL]──[execute]──[release GIL]──[wait]──────────────
Thread 2: ──[wait]─────────────────────────────────[acquire GIL][execute]─
Thread 3: ──[wait]──────────────────────────────────────────────[wait]────
Result: Python threads do NOT parallelize CPU-bound code.
Two Python threads on a quad-core CPU = 1 core effectively used.
Exception: I/O-bound work DOES release the GIL:
Thread calls recv() → releases GIL → OS performs I/O → thread reacquires GIL
Multiple threads CAN overlap on I/O, just not CPU.Escape hatches:
multiprocessingmodule: separate processes, each with their own GIL- C extensions (NumPy, etc.) can release GIL during computation
- PyPy, GraalPy: alternative runtimes without GIL (PEP 703 for no-GIL CPython in progress)
When Threads Win
- CPU-bound work: image processing, compression, cryptography. Use
multiprocessingin Python (bypasses GIL). Use OS threads in Go and Node.js worker threads. - Blocking syscalls you can't avoid: some legacy libraries only offer blocking APIs. Wrap in a thread pool.
- Simple synchronization requirements: a fixed pool of 4 workers processing a queue is simpler with threads than async.
- Shared mutable state: locks are simpler to reason about than async coordination (though both are hard to get right).
When Async Wins
- I/O-bound, high concurrency: HTTP servers, microservices, API gateways.
- Many idle connections: WebSocket servers, long-polling, chat applications.
- Low memory budget: embedded systems, resource-constrained environments.
- Low tail latency: no context switch overhead means more predictable p99.
Naive Approach — Thread per Connection
import threading
import socket
import time
def handle_client(conn, addr):
"""Each client gets its own OS thread."""
data = conn.recv(1024)
time.sleep(0.01) # simulate I/O work
conn.send(b"HTTP/1.1 200 OK\r\n\r\nHello")
conn.close()
def run_threaded_server(port=8080):
sock = socket.socket()
sock.bind(('', port))
sock.listen(128)
while True:
conn, addr = sock.accept()
# NEW THREAD PER CONNECTION — does not scale
t = threading.Thread(target=handle_client, args=(conn, addr))
t.daemon = True
t.start()At 10,000 concurrent clients: 10,000 threads × 8MB stack = 80GB RAM. Crash.
Optimized Approach
Python — asyncio (event loop, single thread)
import asyncio
import time
async def handle_client(reader: asyncio.StreamReader, writer: asyncio.StreamWriter):
"""Single thread, async I/O — handles thousands of connections."""
data = await reader.read(1024) # yields to event loop (non-blocking)
await asyncio.sleep(0.01) # yields to event loop (not thread sleep!)
writer.write(b"HTTP/1.1 200 OK\r\n\r\nHello")
await writer.drain()
writer.close()
async def main():
server = await asyncio.start_server(handle_client, '0.0.0.0', 8080)
async with server:
await server.serve_forever()
# 10,000 concurrent connections:
# Memory: 10,000 × ~2KB coroutine state = 20MB
# CPU: near zero (blocked in epoll_wait 99% of the time)
asyncio.run(main())asyncio with thread pool for CPU work:
import asyncio
from concurrent.futures import ProcessPoolExecutor
executor = ProcessPoolExecutor(max_workers=4) # one per CPU core
def cpu_intensive(data: bytes) -> bytes:
"""CPU-bound work: runs in a process, not blocking the event loop."""
import hashlib
return hashlib.sha256(data).digest()
async def handle_request(data: bytes) -> bytes:
loop = asyncio.get_event_loop()
# run_in_executor: submits to process pool, awaits result without blocking loop
result = await loop.run_in_executor(executor, cpu_intensive, data)
return resultGo — Goroutines + Channels
package main
import (
"fmt"
"net"
"sync"
"time"
)
func handleConn(conn net.Conn, wg *sync.WaitGroup) {
defer wg.Done()
defer conn.Close()
// Goroutine: 2KB stack, cheap to create
buf := make([]byte, 1024)
conn.Read(buf)
time.Sleep(10 * time.Millisecond) // simulate I/O
conn.Write([]byte("HTTP/1.1 200 OK\r\n\r\nHello"))
}
func main() {
ln, _ := net.Listen("tcp", ":8080")
var wg sync.WaitGroup
for {
conn, _ := ln.Accept()
wg.Add(1)
go handleConn(conn, &wg) // one goroutine per connection — cheap!
}
// 10,000 connections: 10,000 × 2KB = 20MB
// Go's runtime schedules them on GOMAXPROCS OS threads
}Go is unique: goroutines give you the "one goroutine per connection" simplicity of thread-per-connection, with async-level memory efficiency. The runtime does the multiplexing.
Worker pool with goroutines:
func main() {
jobs := make(chan Job, 1000)
results := make(chan Result, 1000)
// Start fixed worker pool (bounded goroutines)
for w := 0; w < 10; w++ {
go func() {
for job := range jobs {
results <- process(job)
}
}()
}
// Submit jobs
for _, job := range allJobs {
jobs <- job
}
close(jobs)
}Node.js — async/await + EventEmitter
const net = require('net');
// Single-threaded event loop handles all connections
const server = net.createServer((socket) => {
socket.on('data', async (data) => {
// Non-blocking: returns to event loop while waiting
await new Promise(resolve => setTimeout(resolve, 10)); // simulate I/O
socket.write('HTTP/1.1 200 OK\r\n\r\nHello');
socket.end();
});
});
server.listen(8080);
// Worker threads for CPU-bound work (Node.js 10.5+)
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
if (isMainThread) {
// Main event loop thread
async function runCpuWork(data) {
return new Promise((resolve, reject) => {
const worker = new Worker(__filename, { workerData: data });
worker.on('message', resolve);
worker.on('error', reject);
});
}
} else {
// Worker thread — can do blocking CPU work safely
const result = heavyComputation(workerData);
parentPort.postMessage(result);
}Never block the Node.js event loop:
// BAD: synchronous computation blocks all connections
app.get('/hash', (req, res) => {
const hash = require('crypto')
.createHash('sha256')
.update(req.body.data.repeat(100000)) // blocks for 500ms!
.digest('hex');
res.json({ hash });
});
// GOOD: offload to worker thread
const { Worker } = require('worker_threads');
app.get('/hash', async (req, res) => {
const hash = await runInWorker(req.body.data); // event loop free
res.json({ hash });
});Complexity Analysis
| Model | Memory per connection | Context switch | Max concurrent (practical) |
|---|---|---|---|
| OS Thread | 1–8MB stack | 1–10μs + TLB flush | ~10,000 |
| Goroutine | 2KB–grow | ~100–200ns | ~1,000,000 |
| async/await | ~2KB coroutine state | ~100ns | ~100,000 |
| Event loop (callbacks) | ~1KB | ~50ns | ~100,000 |
Time complexity for N concurrent I/O tasks:
- Thread-per-connection: O(N) OS threads, O(N × switch_cost) scheduling overhead
- Goroutine: O(N/GOMAXPROCS) switches per second — scales with CPU count
- Async: O(1) OS threads (approximately), O(N) task state, O(ready_events) per loop iteration
Benchmark (p50, p99, CPU, Memory)
Setup: 10,000 concurrent clients, each holding an HTTP connection with 10ms simulated I/O latency.
┌─────────────────────────┬────────┬────────┬──────────┬───────────┐
│ Model │ p50 │ p99 │ Memory │ CPU (1req)│
├─────────────────────────┼────────┼────────┼──────────┼───────────┤
│ Python threads (GIL) │ 14ms │ 95ms │ 10GB+ │ 2ms │
│ Python asyncio │ 11ms │ 18ms │ 25MB │ 0.3ms │
│ Go goroutines │ 11ms │ 15ms │ 22MB │ 0.2ms │
│ Node.js async │ 11ms │ 17ms │ 80MB │ 0.5ms │
└─────────────────────────┴────────┴────────┴──────────┴───────────┘
For CPU-bound work (hash computation) on 10,000 requests:
┌─────────────────────────┬────────┬────────┬──────────┬───────────┐
│ Model │ p50 │ p99 │ Memory │ CPU cores │
├─────────────────────────┼────────┼────────┼──────────┼───────────┤
│ Python asyncio (single) │ 120ms │ 200ms │ 25MB │ 1 (GIL) │
│ Python multiprocessing │ 35ms │ 55ms │ 400MB │ 4 │
│ Go goroutines │ 28ms │ 45ms │ 22MB │ 4 │
│ Node.js worker_threads │ 32ms │ 50ms │ 200MB │ 4 │
└─────────────────────────┴────────┴────────┴──────────┴───────────┘Observability
Thread pool metrics
from concurrent.futures import ThreadPoolExecutor
from prometheus_client import Gauge, Counter
thread_pool_active = Gauge('thread_pool_active_threads', 'Active threads')
thread_pool_queue = Gauge('thread_pool_queued_tasks', 'Queued tasks')
class InstrumentedExecutor(ThreadPoolExecutor):
def submit(self, fn, *args, **kwargs):
thread_pool_queue.inc()
future = super().submit(fn, *args, **kwargs)
future.add_done_callback(lambda _: thread_pool_queue.dec())
return futureEvent loop lag (Node.js)
// Event loop lag: time between when a callback was scheduled and when it ran
let lastTick = Date.now();
setInterval(() => {
const now = Date.now();
const lag = now - lastTick - 100; // expected 100ms interval
prometheus.gauge('event_loop_lag_ms').set(lag);
lastTick = now;
}, 100);
// ALERT: event loop lag > 50ms means the loop is being blockedGoroutine leak detection (Go)
import (
"net/http"
_ "net/http/pprof" // registers /debug/pprof/goroutine endpoint
"runtime"
)
// Expose goroutine count as metric
func goroutineCount() int {
return runtime.NumGoroutine()
}
// Alert: goroutine count grows without bound → leak
// Normal: stable count proportional to active connectionsFailure Modes
1. Blocking the event loop (Node.js / Python asyncio):
// BAD: This blocks Node.js for 5 seconds — ALL other requests stall
app.get('/slow', (req, res) => {
const start = Date.now();
while (Date.now() - start < 5000) {} // CPU spin — NEVER do this
res.send('done');
});
// How to detect: event loop lag metric spikes
// How to fix: Worker threads, process.nextTick decomposition, or restructure algorithm2. Goroutine leak (Go):
// BAD: goroutine blocked forever on channel nobody will send to
func leaky() {
ch := make(chan int)
go func() {
val := <-ch // blocks forever if nobody sends
fmt.Println(val)
}()
// Function returns, ch goes out of scope, goroutine is stuck
}
// GOOD: use context for cancellation
func safe(ctx context.Context) {
ch := make(chan int)
go func() {
select {
case val := <-ch:
fmt.Println(val)
case <-ctx.Done():
return // goroutine exits cleanly
}
}()
}3. Thread starvation:
When all threads in a pool are waiting for I/O, and no threads are available for new requests:
# Starvation scenario:
executor = ThreadPoolExecutor(max_workers=10)
async def handler():
# Submits synchronous DB call to thread pool
result = await loop.run_in_executor(executor, blocking_db_call)
# If blocking_db_call takes 30s and there are 10 concurrent requests:
# all 10 executor threads are occupied → 11th request waits foreverFix: Separate thread pools for I/O-bound and CPU-bound work. Set timeouts.
4. Priority inversion:
A high-priority goroutine waiting on a mutex held by a low-priority goroutine. The low-priority goroutine can't run because the high-priority one is consuming CPU time.
Go 1.14+ uses asynchronous preemption to avoid full priority inversion.
5. Python GIL + threads CPU-bound misuse:
# TRAP: looks parallel, isn't
import threading
def compute(n):
for i in range(n):
i ** 2 # pure Python — GIL prevents true parallelism
threads = [threading.Thread(target=compute, args=(10_000_000,)) for _ in range(4)]
for t in threads: t.start()
for t in threads: t.join()
# This takes ~4× LONGER than single-threaded due to GIL contention!
# Use multiprocessing.Pool instead.When NOT to Use Async
- CPU-bound Python code: async doesn't help (still GIL-bound). Use
multiprocessing. - Simple scripts with 1–5 operations: async adds complexity without benefit.
- Blocking-only libraries: some Python libraries (e.g., certain DB drivers, boto3) are synchronous only. Running them in
asynciowithoutrun_in_executorwill block the event loop. - Debugging: async stack traces are harder to read. Callbacks and coroutines fragment the call stack.
- Teams unfamiliar with async: async bugs (forgotten
await, accidental blocking) are subtle and production-unsafe without team experience.
Decision Matrix
I/O-bound CPU-bound
┌──────────────┬──────────────────┐
Python │ asyncio ✓ │ multiprocessing ✓│
│ (threads OK │ (NOT threads — │
│ for legacy) │ GIL kills you) │
├──────────────┼──────────────────┤
Go │ goroutines ✓ │ goroutines ✓ │
│ │ (GOMAXPROCS=ncpu) │
├──────────────┼──────────────────┤
Node.js │ async/await ✓│ worker_threads ✓ │
│ │ (NOT main thread) │
└──────────────┴──────────────────┘Lab
See ../../benchmarks/01-thread-vs-async-vs-event-loop/README.md for a complete benchmark comparing:
- Python threading vs asyncio for 1,000 concurrent I/O tasks
- Go goroutines for the same workload
- Node.js async for the same workload
The benchmark measures p50, p99, memory usage, and CPU utilization.
Key Takeaways
- OS threads: 1–8MB each, 1–10μs context switch. Good for CPU work and legacy blocking code.
- Goroutines: 2KB each, ~100ns switch, M:N scheduling. Go's sweet spot — simple code, async performance.
- async/await: ~2KB state machine, no context switch for scheduling. Best for I/O-bound, high-concurrency Python and Node.js.
- Python GIL: threads cannot parallelize CPU work in CPython. Use
multiprocessingfor CPU. Useasynciofor I/O. - Go's goroutines give you both: one goroutine per connection (like thread-per-connection simplicity) at async memory costs.
- Never block the event loop: any synchronous work > 1ms in Node.js or Python asyncio stalls all other connections.
- Goroutine leaks grow linearly; detect via
runtime.NumGoroutine()metric. Always passcontext.Contextfor cancellation. - The decision tree: I/O-bound → async or goroutines. CPU-bound → OS threads (Go/Node) or processes (Python).
in this section