Datacenter outages are less frequent and severe, but human error remains one of the most persistent challenges, with between two-thirds and four-fifths of major wobbles involving some element of meatbag-related cause.
According to the latest Annual Outage Analysis report from Uptime Institute, the overall picture is one of improving reliability, but with the sting in the tail that when failures do occur they can be significant and costly.
“Outages overall have slowed down,” said Andy Lawrence, Uptime executive director of research. “Datacenter operators are facing a growing number of external risks beyond their control, including power grid constraints, extreme weather, network provider failures and third-party software issues. And despite a more volatile risk landscape, improvements are occurring.”
Some 53 percent of operators reported an outage in the past three years, but this compares with 60 percent in 2022, 69 percent in 2021, and 78 percent in 2020. Just 9 percent of reported incidents during 2024 were classified as serious or severe, which is the lowest level yet recorded.
But preventing human error remains one of the major stumbling blocks in datacenter operations. Uptime says it views human error as a contributing factor rather than a root cause in outages, though it directly or indirectly plays a part in most of them.
Code changes, for example, played a part in several recent Microsoft incidents, such as problems with Azure cloud services in January and a Microsoft 365 outage in March.
Nearly 40 percent of organizations have suffered a major outage caused by human error over the past three years, the report says. Staff failing to follow procedures was a feature in 58 percent of those cases, with faulty processes or procedures to blame in 45 percent.
It is also on the increase too, with the proportion of human error-related outages caused by failing to follow procedures up by 10 percentage points from last year. The reason for this may be the rapid growth seen by the datacenter industry recently and the resulting staff shortages in many regions, Uptime suggests.
To combat this, a greater focus on staff training and real-time operational support may reduce risks more effectively than improving documentation and processes, although these are still important.
Backing this up, 80 percent of operators told Uptime they believe that better management and processes might have prevented their organization’s most recent downtime disaster.
Power-related issues remain the leading cause of major outages. These account for more than half of all cases, while more than one in four respondents to the 2025 Uptime resiliency survey reported that a serious or severe IT outage was caused by a power glitch within the past three years.
The most frequent factor in these is UPS failure – something that recently led to a six-hour blackout at Google Cloud services in the US east zone in America.
Other elements in the power chain can also cause issues such as intermittent faults in the supply and by mismanaged or misconfigured failover to generators.
Grid instability is also listed as a growing concern by Uptime. Rising demand, aging infrastructure, extreme weather, and the variability of renewable energy sources may increase the frequency of power disruptions – making robust on-site systems even more essential. Datacenters near London’s Heathrow airport managed to remain operational despite a power outage that closed the site and caused disruption to a large number of flights in March.
Overall, investments in resiliency and the diligence of operators tell how a success story have led to a reduction in the overall severity and frequency of outages relative to the overall growth in online services.
However, Uptime warns the rising complexity of these environments, driven by AI, automation, and integration between IT and OT systems, is increasing exposure to operational errors and cybersecurity threats. ®