nginx, gunicorn and repeated requests

I ran into an interesting issue today that I could not find anywhere on Google, so wanted to document it.

In my move over from C++ app development to web development, I suddenly found I was hugely distanced from the operations side of things. Part of this was an unfamiliarity with Linux systems, but also because of the more delineated roles between developers, DevOps and operations. More things are automated and handled by DevOps than I previously had to worry about – which leaves me more time to code – but does mean I am less comfortable at this than I used to be. However I still find one of my favorite things to do is investigate the root cause of particularly tricky or hard to reproduce bugs, and this has led me to find some interesting edge cases with our technical setup.

It is common in web development to have multiple layers of routers and load balancers on top of each other performing slightly different roles. e.g. a security / payload manipulation layer on top, then a routing/caching layer underneath routing the request to an app itself. To take advantage of the hardware the software is running on, it is then common for the app itself to have many workers (in a combination of different processes, threads and greenlets) to service the requests.

The app I was working on today used nginx for routing, and then gunicorn for actually serving the code. I’d solved an issue a while ago where nginx has a setting called proxy_next_upstream, which by default passes a request that has timed out onto the next node in the pool. For non-idempotent requests this obviously can cause huge issues, e.g. an object being copied multiple times instead of just once. The solution was to change nginx to only retry a request on a different node when the current node is down (i.e. returning a 502 to nginx) and otherwise return an error after the timeout occurred. This is done by changing this setting to proxy_next_upstream error; (compared to its default, proxy_next_upstream error timeout;)

Recently I noticed the exact same issue I’d found previously seemed to be back, although all configs were correct. I found two odd things – 1) the request was being retried every 30 seconds, even though the nginx timeout was set to the default, 60 seconds, and 2) there were NO further logs from the request that timed out once the 30 second had elapsed, whereas previously the app continued processing the request as normal.

Eventually I found that Gunicorn has a default worker timeout of 30 seconds – and that when it kills a worker, it does two things. First – it completely stops the worker from completing any more work (which is probably good if there’s a deadlock or anything happening) and second, it returns a 502 status code to nginx (Bad Gateway), not a 504 (Gateway Timeout). This means that nginx thinks the request has not been processed at all – and passes it on to the next address in the pool. The reason it hadn’t surfaced previously was because the app was previously using Waitress – it wasn’t until a move to Gunicorn that the problem appeared.

Given the popularity of both nginx and gunicorn, I’m surprised this issue hasn’t been more documented. I guess it is rare for a request to take longer than 30 seconds, and when it does it has just been missed that it was retried.

The solution of course is to just adjust the timeout so that Gunicorn is at least as long as nginx. I ended up giving ours a little more leeway – so that any downstream connections (with the same timeout value) that might be slowing us down will timeout, let us do any cleanup, log a few things, and then get out, rather than killing the worker at exactly 60 seconds.

As always, it was incredibly satisfying figuring out and solving this issue – although I was quite chagrined to see how relatively simple the original issue was, despite manifesting in some confusing and difficult to pin down side effects.

Micro Python: Python for microcontrollers

A good friend of mine, Damien George, has been busy this year creating “Micro Python” – an implementation of Python 3.3 for microcontrollers. He has a Kickstarter going here – this is an amazingly cool project, please check it out! I’ve gone for the ‘starter kit’ – can’t wait to play with this and see what it can do.

Git and .pycs – not the best of friends

Coming from a C++ background, using Python for the first time was pretty amazing.  The ease with which you can link in other libraries and quickly write code is fantastic.  Learning and using Git for the first time on the other hand – that took some adjustment.

Git and I are on fairly good terms these days, and I’ve learned to love some of the features it offers compared to TFS (the last source control system I was using), especially when used in conjunction with Github.  However I hit a problem the other day that made _zero_ sense to me – and took a surprisingly long time to track down, due to lots of false positives in figuring out the issue.

I’d been switching back and forth between a few branches, and all of a sudden my unit tests were failing, and the app appeared to be testing things that shouldn’t exist in this branch. Not only that, the app suddenly refused to run – I kept getting a pyramid.exceptions.ConfigurationExecutionError error.  I found that if I added in a bunch of routes that existed in the _other_ branch, suddenly everything worked.

Experienced Python/Pyramid coders will probably know exactly what the problem was, but I was stumped.  It turns out that our Git is configured to ignore .pyc files – which is normally what you want and works fine. However when files exist in one branch and not in another, it can cause all sorts of issues.

There are two main ways of solving this:

  1. Create a clean_pycs alias in your .bashrc file, that you can run to clean your pycs when you need to – i.e. alias clean_pycs='find . -name "*.pyc" -exec rm {} \;'
  2. Use a Git hook to automatically clean up the files every time you change branches – more info here.

There are other ways you can do it, mentioned in the follow-up to the blog post above, but these are the two easiest.  Which one you use is up to you – I’ve spoken to a few other coders at work, and because we work on a lot of different projects they tend to use the former, rather than having to add the hook in so many Git configs.

Hopefully this blog post helps somebody – I was incredibly thankful when I finally found the blog post linked above!

Learning Python

About a year ago my wife and I packed up all our things and moved to the Silicon Valley to work for SurveyMonkey.  As well as changing job, city and country, I switched from being a Windows application developer to being a backend web developer.  I spent over 8 years writing high performance trading applications in C++ and SQL under Windows, and traded it all in for Python, Javascript, open source libraries, virtual Linux environments, and OS X.  I did get to keep using SQL Server 2008 though surprisingly!

I have noticed in Silicon Valley people tend to be a lot more collaborative than in Melbourne.  I suspect it may be because of the heavier use of open source software, and also the rise of distributed source control like Git.  Back in Melbourne most technology companies seemed to use either Java, C++, or C# – when I left Python and Ruby were becoming more prominent, particularly due to a burgeoning startup community, but still pretty relatively rare compared to here.  These languages don’t seem to lend themselves as well to open source collaboration – however I will admit my experience was extremely filtered as a developer working in a Windows dev-shop that rarely used a third party library.

I’ve found the many tech blogs about Python and Pyramid extremely helpful when trying to solve problems, and in the spirit of this collaborative effort, would like to give back a little with what I’ve found.  I also find that writing up technology discussions and solutions very helpful for refining and consolidating my own thoughts, so even if this isn’t widely read, it’s a very useful exercise.

To kick off, I thought I’d mention something that seems to have caught surprisingly few people out on the web, but had me completely stumped a few weeks back at work.  It’ll follow in my next post.  Thanks for reading!

Steve Job’s cancer went unreported for 9 months

Why? Because he was busy trying new age garbage to try and treat it. Guess how that worked out for him.

It turns out that doctors found the the tumor in October of 2003 during a routine scan. While a biopsy revealed that the cancer was a rare but treatable form, Jobs opted to try and treat the cancer with a special diet. Jobs is a long-time Buddhist and vegetarian, and though research has shown diet to be effective in reversing coronary disease and even slow prostate cancer, surgery is so effective for this type of cancer that most patients live 10 years or more after treatment. Jobs tried treating the cancer with the special diet for nine months until follow-up scans showed the tumor was growing. He then had the surgery in July of 2004, when most of us found out about it, and after a relatively short recovery was back at Apple.

This is exactly why I hate woo-woo stuff with a passion. It can not only offer false hope to people but it can actually hurt or kill them. Rational thinking is underrated.

Sand won’t save you this time.

No, it certainly won’t.

Holy crap that is some scary stuff.  Ever felt like burning a bucket of sand?  Just get yourself some chlorine trifluoride.  Asbestos contaminating your house?  This will burn it right up.

There’s a report from the early 1950s of a one-ton spill of the stuff. It burned its way through a foot of concrete floor and chewed up another meter of sand and gravel beneath, completing a day that I’m sure no one involved ever forgot. That process, I should add, would necessarily have been accompanied by copious amounts of horribly toxic and corrosive by-products: it’s bad enough when your reagent ignites wet sand, but the clouds of hot hydrofluoric acid are your special door prize if you’re foolhardy enough to hang around and watch the fireworks.