Desync Bugs Always Catch My Eye
Reading about why FastCGI is a better protocol for reverse proxies today pulled me back down this rabbit hole. The argument is that HTTP/1.1 has no explicit message boundaries — the message describes where it ends, with multiple ambiguous methods, so proxies and backends can disagree about where one request stops and the next begins. FastCGI (and HTTP/2) solved this with explicit framing. HTTP/1.1 didn’t, and we’re still living with the consequences.
That got me thinking about the other flavors of this problem I’ve hit over the years. Desync bugs are interesting precisely because they’re not one pattern — they’re a category. Protocol ambiguity, timeout handling, connection reuse: the mechanism varies, but the result is the same: you asked for A and got B, and everything looked fine.
I first ran into this back in 2008 with MogileFS and a race condition. A slow database caused an fread to time out waiting for a response. The client treated it as “not found” and moved on — but the TCP buffer still held the original response. The next request read that stale data instead of its own response. We were asking for item A and getting item B. The fix was simple once diagnosed: close the connection on any timeout rather than reuse it.
At Reverbnation I ran into it again — twice. The MogileFS Ruby client had a similar connection safety issue. We ended up streaming (and caching in CDN) the wrong song. And separately, Dalli (the Memcache client) had its own desync problem (can’t find a link to this one), but there’s a recent one with a similar thread — issue #956 — where response data could bleed across requests with timeouts.
What makes these bugs so insidious:
- They’re timing-dependent. Single-threaded tests pass. Low-traffic staging passes. The bug lives in the gap between requests.
- The symptom is far from the cause. You see garbled cache reads or wrong file data. The real problem is connection reuse logic written without thread-safety in mind.
- They keep appearing. Connection pooling is hard to get right, and every client library that implements it is one concurrency mistake away from this class of bug.
The HTTP desync bugs found in Discord’s infrastructure show the same pattern at the HTTP layer — responses routed to the wrong client because something upstream got out of sync. Different protocol, similar critical section.
I don’t expect these to stop appearing. Any time you have pooled connections and concurrent access, the conditions are there. The next one is probably in a library I’m using right now.
The other interesting part about these types of bugs, is that the code one writes to send a request and receive a response seems trivial, and this type of issue can hide in a library for a long time before it’s unearthed.
The framing problem shows up outside of HTTP too. This post on phantom patches describes GNU patch being unable to distinguish actual diff content from diff-shaped text embedded in a commit message — same underlying idea, completely different domain. When the format doesn’t have unambiguous boundaries, parsers disagree, and something gets applied that shouldn’t.