nginx, gunicorn and repeated requests

I ran into an interesting issue today that I could not find anywhere on Google, so wanted to document it.

In my move over from C++ app development to web development, I suddenly found I was hugely distanced from the operations side of things. Part of this was an unfamiliarity with Linux systems, but also because of the more delineated roles between developers, DevOps and operations. More things are automated and handled by DevOps than I previously had to worry about – which leaves me more time to code – but does mean I am less comfortable at this than I used to be. However I still find one of my favorite things to do is investigate the root cause of particularly tricky or hard to reproduce bugs, and this has led me to find some interesting edge cases with our technical setup.

It is common in web development to have multiple layers of routers and load balancers on top of each other performing slightly different roles. e.g. a security / payload manipulation layer on top, then a routing/caching layer underneath routing the request to an app itself. To take advantage of the hardware the software is running on, it is then common for the app itself to have many workers (in a combination of different processes, threads and greenlets) to service the requests.

The app I was working on today used nginx for routing, and then gunicorn for actually serving the code. I’d solved an issue a while ago where nginx has a setting called proxy_next_upstream, which by default passes a request that has timed out onto the next node in the pool. For non-idempotent requests this obviously can cause huge issues, e.g. an object being copied multiple times instead of just once. The solution was to change nginx to only retry a request on a different node when the current node is down (i.e. returning a 502 to nginx) and otherwise return an error after the timeout occurred. This is done by changing this setting to proxy_next_upstream error; (compared to its default, proxy_next_upstream error timeout;)

Recently I noticed the exact same issue I’d found previously seemed to be back, although all configs were correct. I found two odd things – 1) the request was being retried every 30 seconds, even though the nginx timeout was set to the default, 60 seconds, and 2) there were NO further logs from the request that timed out once the 30 second had elapsed, whereas previously the app continued processing the request as normal.

Eventually I found that Gunicorn has a default worker timeout of 30 seconds – and that when it kills a worker, it does two things. First – it completely stops the worker from completing any more work (which is probably good if there’s a deadlock or anything happening) and second, it returns a 502 status code to nginx (Bad Gateway), not a 504 (Gateway Timeout). This means that nginx thinks the request has not been processed at all – and passes it on to the next address in the pool. The reason it hadn’t surfaced previously was because the app was previously using Waitress – it wasn’t until a move to Gunicorn that the problem appeared.

Given the popularity of both nginx and gunicorn, I’m surprised this issue hasn’t been more documented. I guess it is rare for a request to take longer than 30 seconds, and when it does it has just been missed that it was retried.

The solution of course is to just adjust the timeout so that Gunicorn is at least as long as nginx. I ended up giving ours a little more leeway – so that any downstream connections (with the same timeout value) that might be slowing us down will timeout, let us do any cleanup, log a few things, and then get out, rather than killing the worker at exactly 60 seconds.

As always, it was incredibly satisfying figuring out and solving this issue – although I was quite chagrined to see how relatively simple the original issue was, despite manifesting in some confusing and difficult to pin down side effects.

Git and .pycs – not the best of friends

Coming from a C++ background, using Python for the first time was pretty amazing.  The ease with which you can link in other libraries and quickly write code is fantastic.  Learning and using Git for the first time on the other hand – that took some adjustment.

Git and I are on fairly good terms these days, and I’ve learned to love some of the features it offers compared to TFS (the last source control system I was using), especially when used in conjunction with Github.  However I hit a problem the other day that made _zero_ sense to me – and took a surprisingly long time to track down, due to lots of false positives in figuring out the issue.

I’d been switching back and forth between a few branches, and all of a sudden my unit tests were failing, and the app appeared to be testing things that shouldn’t exist in this branch. Not only that, the app suddenly refused to run – I kept getting a pyramid.exceptions.ConfigurationExecutionError error.  I found that if I added in a bunch of routes that existed in the _other_ branch, suddenly everything worked.

Experienced Python/Pyramid coders will probably know exactly what the problem was, but I was stumped.  It turns out that our Git is configured to ignore .pyc files – which is normally what you want and works fine. However when files exist in one branch and not in another, it can cause all sorts of issues.

There are two main ways of solving this:

  1. Create a clean_pycs alias in your .bashrc file, that you can run to clean your pycs when you need to – i.e. alias clean_pycs='find . -name "*.pyc" -exec rm {} \;'
  2. Use a Git hook to automatically clean up the files every time you change branches – more info here.

There are other ways you can do it, mentioned in the follow-up to the blog post above, but these are the two easiest.  Which one you use is up to you – I’ve spoken to a few other coders at work, and because we work on a lot of different projects they tend to use the former, rather than having to add the hook in so many Git configs.

Hopefully this blog post helps somebody – I was incredibly thankful when I finally found the blog post linked above!