As I watch the flight attendant go through the pre-flight safety speech, I cannot help but wonder how many people are paying attention, and more importantly, in a “real” emergency, if people will actually find their nearest exits. That’s not just a problem plaguing airline passengers. I routinely observe managers, developers and engineers ignore smart practices and safety procedures, and head blindly into tasks without proper planning, ill-informed, or worse yet… motivated by fear. It’s not wonder they, along with their code and systems, end up in a prison of their own creation — the kind of legacy scenario we retell like ghost stories, nonetheless, people continue to not heed this information. Knowing where the exits are will help you to avoid getting trapped in your burning jail cell.
Go to a developer-oriented gathering and you’ll hear this: “I have no interest in learning <xyz>” where <xyz> represent some kind of operational tasks or knowledge. Why should they? System administration is not really essential to software engineering, and conversely, ops teams have similar disinterest in writing code. Or they would be doing each other’s job already. That doesn’t mean there aren’t lessons to be learned from one another. In fact, the emergence of devops reflect just that recognition. It’s time for operations to adopt and apply the same discipline and knowledge that their brethren in the software camp have gradually refined over the years. It’s time for agile operations.
I never get why people panic at any point during a deployment. When you’re doing a deployment you’re in only one of two situations: 1) You’ve never deployed the app before 2) You have. If you never deployed the app, there’s no audience to complain if things go wrong, so take your time and figure them out – no need to panic. If you have deployed the app before, you damn sure better have a rollback rip chord in place. Fact is, the vast majority of deployments are done without even the semblance of a rollback plan. The best they have is to restore the database from dump files and build an earlier source control tag. That’s not a plan, that’s inviting disaster.
Deploying sophisticated software onto the internet is not easy. It involves lots of fine, intricate steps being executed in exactly the right sequence, and if any one of a thousand steps is not done in precisely the right order, the whole thing will fail. Knowing this, what do we do? We put a bunch of people on standby just in case anything goes wrong, and make sure that everyone is on a conference call and in a too-massive-to-communicate-effectively chat room at 2:00 am so that they can be Johnny-on-the-spot if something goes to wrong. Here’s a tip: If you are so nervous that something will go wrong that you need to have dozens of sleep-drunk people around ready to fix something they might have broken (or harass someone else who may have broken it) then you have too many damn humans involved in a process that needs to be fully automated.
True operation mavens know that downtime is inevitable. It’s going to happen, despite your best efforts. A blip, a stumble, some cable will get cut. Increasing the “nines” carries quite the price tag, and may not be the best way to maximize ROI. The plans for disaster recovery needs to be balanced, so that focus isn’t solely on the prevention of catastrophes. Equally important, is the rapid recovery for business continuance. Because that is the true goal of uptime — to serve pages, apps and data, to provide for the customers, and continue the revenue stream. This is no longer an insurmountable task, given the resources and knowledge at hand.