publish date
Apr 11, 2023
duration
17
min
Difficulty
Case details
Distributed systems are prone to unexpected failures that can quickly lead to customer-impacting events. Diagnosing and resolving problems rapidly and accurately, along with identifying the actual impact on customers, can often be challenging. This talk provides a Site Reliability Engineering (SRE) perspective on various operational patterns and techniques used to streamline the day-to-day operations of mission-critical systems. We present details of AIOps bot that displays the health of the cluster and infrastructure status and gets details on impactful events such as operations and changes to the cluster during
Share case: