Fundamentals
Set theme to dark (⇧+D)

Failure

An event where a component becomes unavailable. Typically this does not result in loss, and minor actions are required to continue business, like restarting a server.

A good metaphor is a car getting a flat tire: you replace the tire and continue driving.

A way to make systems resistant to Failure is to deploy components redundantly: if one component has a failure the other one would still be working and the platform would still be available. This is also called Design for Failure.

Making use of Availability Zones helps make a platform robust against an entire datacenter going down.