MANAGING YOUR HEROES The People Aspect of Monitoring (a.k.a. Dealing with Outages and Failures) Alex Solomon [email protected] WHO AM I? Alex Solomon • Founder / CEO of PagerDuty • Intersect Inc. • Amazon.com 2 DEFINITIONS 3 Service Level Agreement (SLA) Mean Time To Resolution (MTTR) Mean Time To Response Mean Time Between Failures (MTBF) 4 OUTAGES 5 Can we prevent them? 6 PREVENTING OUTAGES Single Points of Failure (SPOFs) Redundant systems Complex, monolithic systems Service-oriented architecture 7 Netflix distributed SOA system 8 PREVENTING OUTAGES Change (not much you can do about this one) 9 OUTAGES 10
Description: