Temporal paradox

A few years ago, I was in a team building a web application. It had the usual architecture for that time: multiple machines with one web server per machine behind a load-balancing server hierarchy. As the number of clients increased, the server load and latency also grew; whenever the latency became too high, we would install a new server and add it to the load balancing configuration to divide the total load among all servers.

One day we noticed that the average latency was too high, so, as always, we added a server; however, the latency didn’t go down. This was surprising, so we added more servers, but as we did that, the latency didn’t go down; instead, it went up.

Two men digging a hole. “I don’t understand this. The more I remove from this hole, the bigger it becomes.”

A latency histogram revealed two latency peaks: a short, wide one at around 150 milliseconds (ms) and a tall, thin one at 3000 ms. Moreover: as we increased the number of servers, the height of the 3000-ms peak also went up.

When we saw this, we tried reducing the number of servers to see if that would lower the peak, and that’s what happened. Somehow, the more servers we had, the more requests had 3000 ms of latency. We had no idea why, but at least we could cull some servers to bring down this second peak until we found the cause of the problem.

Back then, web servers used process pools to be able to serve many simultaneous requests. If we want to serve 50 requests simultaneously, 50 processes must exist since each process can serve only one request at a time. If a 51st request comes, it must wait for a process to finish its current request.

Web traffic varies a lot throughout the day, so those process pools are dynamic: when there are many requests, they contain many processes; when there are few requests, they save resources by holding few processes.

It’s like a restaurant in a beach town: in the Winter, with few customers, they have few waiters, but when the good weather attracts customers, they hire more. This is not instantaneous, of course: if they get more customers than usual, they will wait a few days to see if the trend continues before hiring new waitstaff.

Just like the restaurant, the process pool also waits a bit to avoid adding processes needlessly: three seconds, or 3000 ms, to be exact.

I bet that when they saw this, the person reviewing the configuration said: “Ooooh!”.

Since we spread the requests among many servers, sometimes a server could go several minutes between two requests, so its process pool would think there was no traffic and reduce its size to the minimum. The configuration file didn’t define a minimum number of processes, so the pool would remove them all. When a request arrived, the empty pool would wait three seconds before creating a process, and that request would have 3000 ms of latency.

That slow request had an outsized effect on the average latency, which is what we always looked at. As we saw it go up and we added more servers, each server would receive fewer requests, end up with an empty pool, take 3000 ms to serve the following request, and the average latency would grow again.

We corrected the problem by setting a minimum size for the process pools and by reducing the number of servers so that each one would receive a reasonable number of requests.

We also had to correct our thinking. Average latency is not a good metric because atypical values affect it significantly. After this happened, we learned to use the median, which is better at representing a user’s typical experience, along with the 90th and 99th percentiles representing the atypical values.

As we acquire experience, we build a mental database of heuristics and rules that help us work faster and more efficiently. However, we should never forget that many such rules are shortcuts, not absolute truths, and that we can’t go on autopilot all the time and must verify that our thinking is correct.

The illustration for this Code Sheet is based on an engraving from an 1867 edition of “Les Misérables.”