These rather promising plans were rudely interrupted when the on-call engineer informed me that our alerting systems were doing their best christmas tree impression and it didn’t seem to be stopping. Hilarity ensued.
When the dust settled and the systems were purring along again, it was time to look back and draw a few lessons.
This doesn’t sound all that bad. It’s how and when things break that gets rough.
The recent AWS outage is a perfect example. No instances were rebooted by the service going down, but that wasn’t the problem. The problem was the cascading failures.
The AWS outage meant that Heroku failed. This means that our app fell over. Moreover, it meant that many of the services we depend on fell over. Some of them were responsible for logging, monitoring, or exception handling. While those were important, they weren’t critical to continued operation.
No, the real problem was that our Redis service died. The Redis service was used to connect our tightly secured backend boxes to our frontend. When AWS and Heroku came back, our Redis service didn’t. This caused an interesting array of internal errors and we learned quite a bit from it.
Chiefly, though, we always approach our systems as if something can break. Because it can, and it will.
After the outage, when I went to re-deploy the secure backend systems, it quickly became obvious that something was wrong. It didn’t take long before I discovered some things had changed.
Github had disabled their old v2 API and forced everyone over to v3. The octokit version we were using was a year old and only worked with the v2 API. Our scripts for deployment only worked with the v2 API. Some quick and dirty debugging was in order.
An hour of sweat, tears, and cursing later I had something that worked. But the only thing that’s certain is change. So I’ll breathe easy only until the next unexpected change pops up.
You’re going to need them.