Case study · 2026

Django → Go: migrating Repos Energy's identity and order services

Two services · Six months · p99 1.25s → 80ms on identity-auth · Zero downtime during cutover

01 Context

Repos Energy runs fuel and EV charging logistics across distributed physical infrastructure. By the time I joined in January 2026, the Django monolith handling identity, authentication, and order placement had grown to a point where the two services were the primary source of tail latency. Not because Django is slow — it isn't, for most workloads — but because the concurrency model Gunicorn uses (pre-forked workers, one request per worker) meant that any upstream slowness in the database or the auth layer created a queue.

At p99, the identity service was taking 1.25 seconds. The order placement service was slower under load because auth was on its critical path.

The decision to migrate to Go had already been made at the CTO level. My job was to own the execution of both services while the rest of the engineering team continued shipping on the monolith.

02 Diagnosis

The first two weeks were profiling. Not estimating — profiling.

I ran pprof against the production Django service to capture representative load, then traced what was actually happening per request. The two findings that mattered:

First: goroutine-equivalent worker saturation. Under concurrent load, Gunicorn's worker pool was exhausting before the database was anywhere near its limit. The bottleneck wasn't I/O — it was the process model.

Second: the database connection pool was sized for the monolith, not for the individual services. Every request to the identity service was competing for connections with other service handlers in the same process. Peak simultaneous auth requests during order bursts were regularly hitting the pool ceiling.

These aren't Django problems. They're problems with how the service had grown inside a process boundary designed for something smaller.

03 Decision

Two Go services: identity-login-service and order-placement-service. Independent binaries, independent connection pools, separate deployment units.

The decision that took the most time wasn't the rewrite itself — it was the cutover strategy. Repos couldn't afford downtime on auth. The approach we settled on was a traffic-splitting proxy in front of both services: the Django handler and the new Go handler ran simultaneously, with traffic shifted incrementally. Django remained the source of truth for sessions until Go had validated parity at sufficient traffic percentage.

I wrote the Go services using the standard library plus pgx for Postgres. No framework. The reasoning was that a framework would add abstractions I couldn't profile through — and after the Django diagnosis, I was specifically unwilling to accept opacity at the framework layer.

For async work that had previously lived inside Django signals, we moved to a message queue. This made the async boundary explicit instead of implicit, which simplified the profiling story considerably.

04 Result

After the cutover completed:

80ms

p99 identity-auth (was 1,250ms)

0ms

downtime during cutover

The goroutine count reduction was the number I watched most closely. It was the one that validated the diagnosis — if the issue had been somewhere else, reducing the concurrency model overhead wouldn't have moved it. It moved.

Order placement p99 improved as a downstream consequence of auth latency dropping. I'm less confident attributing all of that improvement to the identity migration because other changes shipped in the same window.

05 Retrospective

What I got wrong: I underestimated how much institutional knowledge lived in the Django service's middleware chain. Two pieces of auth logic had been added as middleware patches over the previous years and weren't documented anywhere except in the middleware order in settings.py. I found one during integration testing and one in production — the second one at 11 PM on the second day of the cutover.

What I'd do differently: before writing any Go, spend a week writing a spec of every behavior the Django service exhibited — not from the codebase, but by reading its tests and then testing against production. Treat the existing service as a black box with a behavior contract, not as a reference implementation.

What surprised me: Go's profiling tooling (pprof, trace) is genuinely better than anything I'd used before for finding the difference between "my code is slow" and "the runtime is doing something I didn't expect." The goroutine leak root cause was visible in a 30-second pprof capture in a way that would have taken me an hour to find by reading code.

The Django monolith continues running the other services. The migration was always a strangler fig, not a rewrite.