Limiting the Deployment Blast Radius
Here’s what could go wrong in complex application environments and how you can contain your risk using tools and best practices.
Complex application environments often require deployments to keep things running smoothly. But deployments, especially microservice updates, can be risky because you never know what could go wrong.
In this article, we’ll explain what can go wrong with deployments and what you can do to limit the blast radius.
A blast radius is an area around an explosion where damage can occur. In the context of deployments, the blast radius is the area of the potential impact that a deployment might have.
For example, if you deploy a new feature to your website, the blast radius might be the website itself. But if you’re deploying a new database schema, the blast radius might be the database and all the applications that use it. The problem with deployments is that they often have an infinite blast radius.
While we always expect some blast radius, an infinite blast radius means that anything could go wrong and cause problems. That’s bad.
Hastily developed and scheduled deployments are often the leading causes of infinite blast radius. When you rush a deployment, you’re more likely to make mistakes. These mistakes can include forgetting to update the documentation, accidentally breaking something in production, or not giving other interested parties, like dependent service owners, a chance to reflect and respond to the deployment.
Have you ever woken up to frantic calls that your app is not working only to discover that another team had an unplanned deployment and didn’t inform you that it was going ahead? I have.
Telling someone they’re deploying without giving them adequate time to prepare is another source of trouble. Not communicating is a recipe for disaster.
One of the most important things you can do to limit a deployment’s blast radius is to test it thoroughly before pushing it to production. That means testing in a staging environment configured as close to production as possible. It also means doing things like unit testing and end-to-end testing.
By thoroughly testing your code before deploying it, you can catch any potential issues and fix them before they cause problems in production.
If your developers had to guess what the requirements were, they most likely didn’t have a clear test plan either. Unclear requirements can lead to code that works in development but breaks in production. In addition, it can lead to code that doesn’t play well with other systems. It can also lead to features that don’t meet users’ needs.
To avoid this, make sure you have a clear and complete set of requirements before starting development. Precise requirements will help ensure that your developers understand what they need to build and that they can test it properly before deploying it.
The most common way that deployments go wrong is when configuration changes occur. For example, you might change the database settings and forget to update the application. Or you might change the way you serve your website and break the links to all of your other websites. Configuration changes are often the cause of deployments going wrong because they can affect so many different parts of your system.
Human beings get tired, make mistakes, and forget things. When deploying a complex system, there’s a lot of room for error. Even the most experienced engineers can make mistakes.
Sometimes, things get changed in production in the chaotic world of production support, and it doesn’t trickle down to the testing and development environments. Environment inequality can lead to problems when you go to deploy. Your application might not work in the new environment, or you might not have all the necessary files and configurations. Or there are the dreaded things in the environment that weren’t in testing, and no one knows why they’re there.
Finally, the software itself can go wrong. Software is complex, and it’s often hard to predict how it will behave in different environments. For example, you might test your software in a development environment, and it works fine. But when you deploy it to production, there might be unintended consequences. For example, if your code is “spaghetti” and is tightly coupled or is difficult to maintain, you probably have issues with deployments.
A lot can go wrong! But what if there were ways you could limit the blast radius?
There are a few ways to limit the blast radius.
Set your team up for success by establishing a regular cadence and process for deployments. Consistency will minimize the potential for human error.
When everyone knows when to expect deployments or what to expect when there’s a particular case, your deployments will go much smoother. You should also plan what to do if something goes wrong. Can you roll back? Are there backups?
Part of planning the deployment is making sure that all responsible and affected parties are informed adequately and have time to review, reflect, and respond. Communicating includes sending an email to all users informing them of the upcoming deployment and telling them what to expect. In addition, it means communicating with the people who will do the deployment. Give them adequate time to prepare, review, practice, and clear their calendars. It’s better to err on the side of providing too much information rather than too little.
The most important way to prepare is to understand your risks. First, you need to know what could go wrong and how it would affect your system. Only then can you take steps to prevent it from happening.
Another way to limit the blast radius is to automate as much of the deployment process as possible. If there’s something that can be automated, then do it. For example, automating your deployments will help ensure consistency and accuracy. This way, you can be sure that everything is covered, it’s done correctly, and that you forget nothing.
There are many different tools available to help you automate your deployments. Choose the one that best fits your needs and then automate as much of the process as possible.
Finally, use an internal developer portal. An internal developer portal organizes much of the information you may need to help limit the blast radius.
For instance, it can help you understand the downstream services that depend on your service, identify the owners of those services, find related documentation, and visualize key metrics about those services all in one place. This enables you to know who to communicate with ahead of a deployment and how to get in touch, provides context on how those downstream services work, and offers a place to monitor those services during testing (assuming your internal developer portal is infrastructure-aware at the environment level).
An internal developer tool can also help you understand your risks by providing access to version control information that gives you access to what was changed, when it was changed, and by whom. This way, you can identify any changes that might have introduced risk and take steps to mitigate that risk.
One such deployment management tool is configure8. It will help you understand the blast radius of your deployments and limit the potential fallout, ensuring your deployments run smoother and flawlessly. It enables your team to answer the following questions:
Learn more about how your engineering team can benefit from Configure8 and how it can help you limit the blast radius of your deployments. You can check it out here.
Steven Lohrenz is an IT professional with 25-plus years of experience as a programmer, software engineer, technical team lead, and software and integrations architect. They blog at StevenLohrenz.com about things that interest them.