How to automate everything
When I worked at Facebook, I worked on making cross-region failover for source control more effective and automated.
Before my work, when we were doing a cross-region failover, we brought the whole team in a room for the day. We also brought food and coffee and ran through the commands as a group to add extra-safety and make sure we didn't cause an incident. We ran together through a list of 20 different commands to run and expectations/check about what they would do. The whole process was slow and error-prone because of how complex the commands were and the number of context dependent substitution in those commands (like the name of the regions)
In one of the most defining moment of my career, one of my colleague ran a command in the wrong terminal window during a failover and it caused an incident. The incident review that followed was really tough for me and I was scared of participating in another failover. It was the nudge I needed to automate the processes. Here is how I proceeded:
- (1) Turn the runbook into a script printing text
- (2) Replace the steps with commands gradually
- (3) Add a lot of assertion
- (4) Add a dry-run mode
- (5) (Optional) Create a model of the system
- (6) Convincing others and going live
Step 1: Turn the runbook into a script printing text
This step is key to get started. Take all the steps in your runbook, commands or not, and turn them into a script. A minimal example looks like:
def step(txt):
print(txt)
print('Press enter when done')
input()
step('Create an empty file for the report')
step('Check disk space on your computer')
step('Add the date to the report file')
step('Log in to the primary server')
step('Check the disk space')
This shift the source of truth from a document
to code
. So when someone knows how to automate one of those steps, they can just replace the step
call with some code to perform the check.
Step 2: Replace the steps with commands gradually
At this point you have a script that contains all the steps of your manual procedure. You can start making the script more automated by replacing the calls to step
with actual code performing the steps. What's great is that you don't have to know how to automate everything to make an impact, even automating one step is making progress.
Step 3: Add a lot of assertion
Shifting from a manual workflow to an automated one causes angst. You may we worried that you are going to break the system, that it was much safer when you followed the runbook because it could "always be stopped" or was "not running too fast". Fear drives this way of thinking, and one of the way to tame your fear is to add assertions
about the system. Think about every step and what could be asserted before and after the test to make sure that it ran correctly, add those to your code.
Step 4: Add a dry-run mode, make it the default
The script you wrote may be performing a really major function. For example in the case I mentioned above, the script was failing over source control function. For example in the case I mentioned above, the script was failing over source control from one region to the next. It came with major impact on several systems and it would have been a blunder to run it unintentionally. Therefore I made the script "dry-run" by default. When running in dry-run mode, your script should not perform any writes
, it should not change the state of the underlying system. This way if someone runs the script by mistake it will not perform the operation.
Building a dry-run mode requires you to think for each step, which one will be a write
and which one will be a read
. As a shortcut you can wrap the code that run commands to be aware of the dry-run mode:
from dataclasses import dataclass
import os
import subprocess
import fire
def step(txt):
print(txt)
print('Press enter when done')
input()
@dataclass(frozen=True)
class CommandRunner:
dry_run:bool=True
def __call__(self, command):
if self.dry_run:
print(f"Would run {command}")
else:
print(f"Running: {command}")
subprocess.check_output(command, shell=True)
class SourceControlFailover(object):
def failover(self, dry_run=True):
run = CommandRunner(dry_run)
print('Create an empty file for the report')
run('touch /tmp/report.csv')
if not dry_run:
assert os.path.exists("/tmp/report.csv")
step('Check disk space on your computer')
step('Add the date to the report file')
step('Log in to the primary server')
step('Check the disk space')
if __name__ == '__main__':
fire.Fire(SourceControlFailover)
You can run the script with:
python script.py failover
And when you are ready for the commands to run for real:
python script.py failover --dry-run=False
Here I used the Fire python library which makes it easy to build command line tools.
Step 5: (Optional) Create a model of the system
Let's say that you automated a few steps but you are still scared of running the script, you should start testing it against a test environment (also called QA, or pre-prod environment). Another strategy you can reach out to is building a model
of how the system under test behaves based on what you observed. For example you can create a model of a server and put in code the assumption you have about the system. This idea is very similar to the fake gmail service I used in my post Mail merge in 100 lines of clojure.
Step 6: Convincing others and going live
The hardest part of automating existing processes is convincing others that it will work and will be worth it. Running commands feels manually provide a false sense of safety, and giving control to automation feels extremely scary to some people. At Facebook, when I ran the script for cross region failover for the first time it felt scary, it felt like I could cause an incident and it felt like I could be making a terrible mistake.
Years later, dozens of such script written I have yet to encounter a case when it was not worth automating. I hope that the approach above can convince you to automate some of the tasks that you are performing at work, the scarier the better. Happy automating!