This post is born out of frustration, and as such might read like it is a rant. It is. Please do not take the following as the ultimate truth. I might be wrong.
My hope is that it will resonate with you and help you discover the pain points in your deployment procedures.
One of the many tenants of agile software development is that things that are hard and time-consuming should be done often. The more time it takes, the more often you should do it. The scarier it is, the more often you should do it. This seems like a paradox, but in reality it turns out nicely. If it takes a long time, you automate it. If the risk of error is high, you automate it (and add safety checks in the automation!). This added automation makes things less scary, and less time-consuming, saving time (and not to mention reducing stress!) in the long run.
The same goes for provisioning and deployment. If done incorrectly, it can—under the wrong circumstances—be an extremely scary and time-consuming process.
Let me start by telling you one of my experiences with deployments. Through the combined smarts of my colleagues and myself, it always turned out to work, although there have been quite some close calls. Looking back, it still scares the living daylights out of me. The product we supplied the software for was a turn-key 'connect it to your network' appliance, commissioned by a start up. Customers were to be all around the world.
We developed a script which arranged for a clean install, and successive updates were performed using Ubuntu's unattended-upgrade. This we released into production. Everything was working fine.
That is, until we had to add a dependency. Turns out, unattended-upgrades does not really play nice with adding dependencies, with a lot of "interesting" caveats.
Another "interesting" aspect was that of error-recovery. A 'factory reset' was planned, but—and I say this with 20/20 hindsight—we did not think 'error recovery' through well enough. Occasionally we would run into issues which would break any future upgrades. Breakage of future upgrades would require a total recall. We all knew that, but—again, hindsight—did not really prevent this at the 'design level', instead requiring an inhuman amount of testing to make sure that future upgrades would continue to keep working.
Preparing for a release was to me an extremely stressful event, even though the release itself was merely e-mailing a few packages to the customer. Before we had continuous integration set up, it was an even bigger nightmare, because we also had somewhere from half-a-day up to a week to make sure that the new packages would install, at all.
One of the things I learned on this project was the following: Make the effort needed for recovering from a botched upgrade explicit, and take steps to reduce that cost!
If you're just maintaining one system, have root access, and all your serving is static HTML, error-recovery is cheap. If you're maintaining thousands of systems all over the world, where only a few engineers have root access (and need to access the machines physically), and the data stored on the machine is of a highly critical nature, error-recovery should be on top of your mind while designing the system.
Back to brainstorming and rambling about what makes a good deployment procedure.
Good deployments are
Automated. Automated deployments are fallible, manual deployments more-so.
Staged. Changes should progress from "Develop" to "Test", then "Accept" and finally "Production".
Frequent. Every good change should move to production in as little time as possible.
Tested. Ideally bad code never makes it to the editor. You should have a strategy in place for when it makes it to version control.
Reversible. For when bad code reaches production.
Pre-build. The Gnu C compiler should not be necessary on the server.
Reproducible. Everybody should be able to do it with the same results.
Idempotent. Deploying the same code twice, is the same as deploying it once.
Fool-proof. It should be near-impossible to break the deployment process. We want to focus on delivering features, not on keeping everything running.
Separated from the data. Because restoring code is easy. Restoring data is not always.
Now, this is not just a list of things one wants to have. The more people working on one project, the more important it becomes to actually have it.
One should always strive to reduce the probability of error, and the severity thereof. By automating deployments as far as possible, the chance of human error can be reduced as much as possible.
Also, during the automation, you start thinking "What can go wrong?", and deal with that in an appropriate manner. There is always going to be a cost-benefit balance to be considered,
A developers playground is their own development environment. They can change whatever he wants, and nobody should be the wiser. It should be isolated from any other developer, completely. But, as soon as they are confident of the changes, it is time to share the changes, for instance, by committing the changes into version control and pushing it out into the world.
From there, the changes should be deployed to a shared testing environment as soon as possible. Failures at this stage should result in a notification, and the breakage should be fixed.
The changes should eventually progress to a 'customer acceptance' environment, where the customer-representative can decide if the product is ready for the production environment, where users will see and experience all the shiny new features in their full glory.
Another form of staged deployments also exists. Gradually upgrade the servers in batches, and stop the roll-out when new errors occur. Of course, if you have just one server, it's an all-or-nothing approach, but for larger server-farms, like those of Google, GitHub and Facebook, a gradual roll-out is a viable solution.
So, after 4 years of development, and 3 billion changed lines of code by 100 developers, let us make our next deployment. Something went wrong. The developer who wrote that code has already switched jobs 9 times in between. How are we going to fix it?
Contrast this to the following:
So we just pushed this 3-line change into production, and something broke. Let's just revert the change.
Of course, things are never that simple. Some times the buggy code just isn't exercised often, so the bug is only discovered a few months later. But still, the general principle is sound: the faster things are in production, the easier it is to discover what causes the bug. Or at least, the commit which caused the bug.
The other reason for frequent deployments is value. Code written is an investment. The return on that investment can only start after deployment.
Everyone is fallible, everybody makes mistakes. The sooner a mistake is discovered, the sooner it can be repaired. Discovering bugs requires exercising (or sometimes, just reading) code. Because discovering sooner is better, testing should be done often. Manual testing is mostly done rarely, so better make testing automated. Reading up on the propaganda published by people passionate about test-driven development (or behaviour-driven development) should be enough to convince you on starting with automated testing.
But, there is another point I would like to make. Not only the code should be tested, but you should also test the automated deployment. Ideally, this means that the "Test" and "Accept" environments are deployed using the same procedure as the "Production" environment. Because if the deployment procedure is different, the procedure for deploying production is not tested with the code.
Sometimes buggy code reaches production. Sometimes you have perfect code which matches all the requirements given by all the stakeholders (yeah right, like that's gonna happen!) but it turns out that sales dropped drastically because the suggested color-change for the 'buy' button reduced purchasing click-through. In any case, the deployment needs to be reversed before more damage is done.
The same goes for the static-files for Django projects. You should not run the manage.py staticfiles command in production. You should run it in your build-environment, from setup.py, or your Makefile, or whatever build process you choose.
Another example is for when you use Cython. Make sure the modules are built before deployment.
"It works on my machine.", or "It worked yesterday, but nothing has changed since.". Few things are worse than requiring magic enchantments while hoping that the phase of the moon is correct for the given alignment of the planets while the tides are turning in order to make a successful deployment. And on failure, the deployment of course says "Something went wrong", not "You are missing dependency X" or "Given folder does not exist".
As such, the deployment requires tribal knowledge. Facts that are mostly only known by the tribe elders, and are shared on-demand, which is almost never, because they are always the ones doing the deployment. Some of that knowledge will of course be codified into scripts. But nobody is going to tell you the correct arguments for calling the script. Or the fact that you need to do a manual step before (or after) running the script.
To combat this, you should strive to make sure that the deployment works fully automatic. Preferably, make sure all the dependencies are installed automatically. Ideally you should be able to just point your provisioning scripts at a bare Ubuntu (or Gentoo, Debian, Windows, Mac OS X) install, let it perform its codified magic, and end up with a running install. Writing fully portable provisioning is hard, so I'm fine with saying "Start with a clean Ubuntu 14.04", but not "Start with a clean Ubuntu 14.04 on which <list of 20 dependencies> are already installed.".
A quick litmus test is to let the new guy (or gal) on the team do the next deployment while watching over their shoulders. Of course, let them start by reading the manual, and then let them deploy. If anything goes wrong, you failed. No, it is not the fault of the new guy. It is the fault of the team.
"Idempotent" is a word which I mostly know from mathematics. It means: performing the action twice (or even many times over) is the same as performing it once. If the code changed, the new code should be deployed. If the code did not change, nothing interesting should happen.
Of course, what does happen depends on the chosen procedures. It might happen that an identical Docker image is build, and then a switch-over is performed to the new image. Or it is determined the code is unchanged, and nothing happens.
No change in the deployed code should break the deployment process itself. Nothing should break deployments. Period. Every minute spent worrying about deployments being successful or not causes an increase in the stress-levels in your team, as well as a decrease in features being delivered.
Scaling out is easier if the data is separated properly from the codebase. In larger setups, I would recommend that there are separate machines for the database, and separate machines for the code.
As of yet I do not have enough experience as to recommend a proper solution that is fully worked out. However, I feel confident enough to label a few of the components of such a solution.
Virtual machines, Docker, LXD, Linux Containers, Heroku Dinos, virtualenv, chroot, etc. Anything that allows you to separate parts of the install, and reduce the coupling between dependencies.
Debian packages, Python wheels/eggs, RPMs, etc. Anything that allows you to build a ready-for-use package for deployment.
Autotools (autoconf/automake), Makefiles, gulp tasks, etc. Anything that allows you to define how a codebase is to be transformed in something executable.
Puppet, Ansible, Chef, etc. Anything that allows you to define the state a machine should be in.
As of now, I do not really have a experience to recommend which of the components to pick. As such, my opinion is not fully formed and trustworthy. However, here are some sentences I currently believe to be true:
Each service you run should be as separate as possible from each other and from the database. After all, how would you know if you got the dependencies correct if they're not properly separated?
You should use a container for each separate service. A container is either a separate virtual machine, a docker instance, a heroku dino, a virtualenv, a chroot. This is to help with the previous point: separation.
Upgrading a service should be a matter of replacing a service container with a new version. Without having to transfer data from the old container to the new! Without having to upgrade other parts of the system! Just replacing a container should be very simple. By just replacing the containers, that remains easy.
You should be able to gracefully handle services with conflicting dependencies. Because nothing is more annoying than not being able to change dependencies.
Data should not be stored in the same container as the handler of the data. After all, if you just replace a container, you don't want the data gone, do you? As far as I'm concerned, why not mark the whole container as read-only. Not sure if this is possible, but it should be!
One last thing I have to mention in this rant, as it bothers me (and has bothered me for quite some time now).
Your codebase should not need to know in what set-up it is going to be deployed. Your codebase should not have a debian directory in it. It should not have information on building RPMs in it. Of course that information must be in version control, though. So keep it in a separate repository. The debian-repository should not be a sub-repository of the code-repository. Maybe the other way around.
Let me repeat myself, as clearly as possible. The code you are working on should not be responsible for knowing how that code is to be deployed. Switching from Debian package based deployment to RPMs should not need a single commit to the repository containing your code. Switching from Puppet to Ansible should likewise require no changes.