TCP Deep Dive

Problem

TCP's reliability guarantees come at a cost: connection setup (1.5 RTT), connection teardown (2 RTT), head-of-line blocking, and retransmission delays. Knowing exactly what TCP does explains why every optimization in connection management (pooling, keepalive, HTTP/2) works.

Why It Matters (Latency, Throughput, Cost)

TCP connection lifecycle costs:
  Handshake (SYN/SYN-ACK/ACK):  1.5 RTT = 15ms (cross-AZ)
  TLS 1.3 on top:                1 RTT additional = 10ms
  First data exchange:           1 RTT
  Total to first byte (TTFB):    3.5 RTT = 35ms

With connection pool (reuse):
  Time to first byte:            0 (connection already ESTABLISHED)
  First data exchange:           1 RTT = 10ms

TCP State Machine

                    CLOSED
                      β”‚
              SYN sentβ”‚ (client)
                      β–Ό
                  SYN_SENT ──────────────────────────────┐
                      β”‚  SYN received + SYN-ACK sent     β”‚
                      β”‚  (server: LISTEN β†’ SYN_RECEIVED) β”‚
                  ACK received                           β”‚
                      β–Ό                                  β”‚
                 ESTABLISHED β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  (data flows)
                      β”‚
              FIN sentβ”‚ (active close)
                      β–Ό
                FIN_WAIT_1
                      β”‚ FIN-ACK received
                      β–Ό
                FIN_WAIT_2
                      β”‚ FIN received from remote
                      β–Ό
                 TIME_WAIT  ← stays here for 2Γ—MSL (60–120 seconds)
                      β”‚
                   CLOSED

TIME_WAIT: The connection stays in TIME_WAIT for 2 Γ— MSL (Maximum Segment Lifetime, typically 60s) to ensure delayed packets from the old connection don't corrupt a new connection on the same port pair. This is why rapidly cycling connections (no pool) can exhaust ephemeral ports (EADDRINUSE).

TCP Flow Control and Congestion Control

Flow control (receive window): Receiver advertises how much buffer space it has. Sender cannot send more than the window allows.

Bandwidth-delay product (BDP):
  BDP = bandwidth Γ— RTT

For a 1Gbps link with 10ms RTT:
  BDP = 1,000,000,000 Γ— 0.010 = 10,000,000 bytes = 10MB

TCP window must be at least 10MB to fully utilize this link.
Default TCP receive buffer: 4MB β†’ only 40% link utilization!
Tune: net.core.rmem_max and net.ipv4.tcp_rmem

Slow start: New TCP connections start with a small congestion window (cwnd = ~10 MSS = ~14KB). cwnd doubles each RTT until packet loss is detected. This is why a fresh connection to a CDN is slower than a warm one.

Nagle's Algorithm

Nagle's algorithm coalesces small writes to reduce packet count:

  • Buffer small writes until: ACK received OR buffer β‰₯ MSS OR 40ms timeout

Impact on backend: A database client sending a small query followed by recv() will wait up to 40ms for Nagle before the query is sent. Fix: TCP_NODELAY socket option.

import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)  # disable Nagle

Most database drivers set TCP_NODELAY by default. Verify yours does.

Key Takeaways

  1. TCP connection = 1.5 RTT setup + 2 RTT teardown. Pool connections to avoid this.
  2. TIME_WAIT (2Γ—MSL) is intentional β€” it prevents packet reuse bugs. Don't try to eliminate it; increase your ephemeral port range instead.
  3. Slow start means fresh connections have low throughput initially. Warm connections perform better β€” another argument for connection pooling.
  4. BDP = bandwidth Γ— RTT. Your TCP window must be β‰₯ BDP to saturate a link.
  5. Nagle's algorithm delays small writes up to 40ms. Set TCP_NODELAY for database and RPC connections.
  • ./05-congestion-control.md β€” CUBIC, BBR algorithms
  • ../../07-core-backend-engineering/02-connection-pooling.md β€” TCP cost eliminated by pooling

πŸ“š Related Topics