Deploying a production application can be quite the chore. On the road to &! 2.0, our processes have changed significantly. In the beginning stages of &! 1.0, I hate to say it, but deploys were a completely manual process. We logged in to the server over SSH, pulled from Git, and restarted processes all by hand. Less than ideal, to say the least.
Managing those processes was just as bad; we were using forever and a simple SysVInit script (those things in /etc/init.d for you non-ops types) to run it. When the process would crash, forever would restart it and we'd be happy. Everything seemed great, but then one day we accidentally pushed broken code live. What did forever do? Kept trying to help us, by restarting the process. The process that crashes instantly. Several CPU usage warning emails from our hosting provider later, we realized what had happened and fixed the broken code. That's when we realized that blindly restarting the app when it crashes wasn't a great idea.
Since our servers all run Ubuntu, we already had Upstart in place so swapping out the old not-so-great init.d scripts for the new, much nicer, Upstart scripts was pretty simple and life was good again. With these we had a simple way to run the app under a different user (running as root is bad, please don't do it), load environment variables, and even respawn crashed processes (with limits! no more CPU usage warnings!).
But alas, manually deploying code was still a problem. In came fabric. For &! 1.0 we used a very simple fabric script that essentially did our manual deploy process for us. It performed all the same steps, in the same way, but the person deploying only had to run one command instead of several. That was good enough for quite some time. Until.. one day.. we needed to rollback to an old version of the app. But how?
That instance required us to dig through commit logs to find the rollback point, and manually checkout the old version and restart processes. This, as you can guess, took some time. Time that the app was down. We knew that this was bad, but how could we solve it? Inspiration struck, and we modified the fabric script. So then when it deployed a new version of code, it first made a copy of the existing code and archived it with a timestamp. Then it would update the current code and restart the process. This meant that in order to rollback, all we would have to do is move the archive in place of the current code and restart the process. We patted ourselves on the back and merrily went back to work.
Until, one day, we realized the app had once again stopped working. The cause? We overlooked how fast the drive on our server could fill up with us storing a full copy of the code every single time we deploy. A quick little modification to the script so that it kept only the last 10 archives and some manual file deletion, and we were back on track.
Time went on, the deploy process continued to work, but much like every developer out there, we had dreams of making the process even more simple. Why did I have to push code, and then deploy it? Why couldn't a push to a specific branch deploy the code for me? Thus was born our next deploy process.. A small server to listen for Github web hooks. Someone pushes code, the server would see what branch it was pushed to, if it was the special branch "deploy" the server would run our fabric scripts for us. Success! Developers could now deploy to production without asking, just by pushing to the right branch! What could go wrong?
As I'm sure you guessed, the answer is a lot. People can accidentally push to the wrong branch, deploying their code unintentionally. Dependencies can change and fail to install in the fabric script. The fabric script could crash, and we would have no idea why. We had logs, of course, but the developers didn't have access to them. All they knew was they pushed code and it wasn't live. So we'd poke around in the logs, find the problem, fix it, and go about our business grumbling to ourselves. This was also not going to work.
After much deliberation, we went back to running a separate command to deploy to the live server. That way the git branches could be horribly broken, people could make mistakes, and we wouldn't end up bringing down the whole app.
To help prevent broken code, we also changed our process for contribution. Instead of pushing code to master, developers are now asked to work in their own branch. When their code is complete, and tests pass, they then submit a pull request to have their code merged with master. This means that a second pair of eyes is on everything that goes in to master, and feedback can be given and heard before deploying code.
To help enforce peer review, I wrote a very simple bot that monitors our pull requests and their comments in Github. Pull requests now require two votes (a +1 in the comments) before the "merge" button in the pull request will turn green. Until that happens, the button is gray. Although easy to override, when the button is gray a warning is displayed if it is pressed. This was a nice, neat, unobtrusive way of encouraging everyone to wait until their code has been reviewed before merging it into master and being deployed.
While still not perfect, our methods have definitely matured. Every day we learn something new, and we strive to keep our methods working as cleanly and smoothly as possible. Regular discussions take place, and new ideas are always entertained. Some day, maybe we'll find the perfect way to deploy to production, but until we do we're having a lot of fun learning.
Add your email to join the And Bang 2.0 private beta invite list: