How to handle incidents, downtime and outages

Abstract:

100% uptime is impossible. Modern architectures are designed around failure but what does that mean for the human aspect of incident management? This talk will consider how to prepare for outages, how to structure the response, and how those experiences and techniques differ for small and large companies.

Key topics will include:

On call - rotations, scheduling, systems and policies
Preparing for downtime - teams, systems and product architecture
Documentation
Checklists and playbooks
How we actually handle incidents
Post mortems

This is a more formal talk based off a discussion session I ran with a couple of engineers from Yelp

Speaker: Speaker 30