I’m excited to share a new whitepaper that describes the Power BI team’s approach to maintaining a reliable, performant, and scalable service for our customers.
It covers aspects related to monitoring service health, mitigating incidents, release management and acting on necessary improvements. This document was created to share knowledge with our customers, who often raise questions regarding site reliability engineering practices. The intention is to offer transparency into how the Power BI team minimizes service disruption through safe deployment, continuous monitoring, and rapid incident response. The techniques described here also provide a blueprint for teams hosting service-based solutions to build foundational live site processes that are efficient and effective at scale.
As service owners we need to make sure our customers can rely on us to use Power BI for mission critical work. This trust is shown in the rapid growth, with 6 straight years of triple digit paid growth since its launch. Power BI is now being used by 97% of Fortune 500 companies.
The results illustrated in the table below are the direct result of engineering, tools, and culture changes made by the Power BI team over the past few years.
|Time to Notify (TTN) Customers of Incidents – P75||110 min||14 min||87%|
|Time to Acknowledge (TTA) When Incidents Occur – P75||11 min||0.76 min||93%|
|Time to Mitigate (TTM) Issue – P50||49.3 min||2.8 min||94%|
|% Alerts Automated (Enrichment)||7%||88%||1,157%|
|% Alerts Mitigated w/o human intervention||0%||82%||New Capability|
|% Incidents Escalated to SMEs (Subject Matter Expert)||6.7%||0.34%||95%|