
The Evolution of SRE at Google
An example of this phenomenon occurred at Google in 2021. To maximize efficiency, we also monitor the amount of quota used by each software service. If a service consistently uses fewer resources than its quota, we automatically reduce the quota. In STPA terminology, this quota adjuster has the control action of reducing a service’s quota. From a security perspective, we ask when this behavior is unsafe. As an example, if a permission adjuster lowered a service’s quota below the actual needs of the service, then this would be unsafe – the service would be starved for resources. This is what STPA says unsafe control action (UCA).
STPA analyzes every interaction in a system to comprehensively determine how interactions must be controlled to ensure system security. Unsafe control behavior can cause the system to enter one or more dangerous states. There are only four possible types of UCA:
-
The required control action is not provided.
-
Incorrect or inadequate control actions are provided.
-
Control actions are provided at the wrong time or in the wrong sequence.
-
Control action stops too quickly or is applied for too long.
This specific unsafe control operation (reducing allocated quotas below what is required by the service) is an example of the second type of UCA.
Simply identifying such unsafe control behavior is only partially useful in itself. If it is unsafe for the quota adjuster to reduce the allocated quota when the service needs it, then the system must prevent this behavior, that is, the quota adjuster must not reduce the allocated quota when the service currently needs it. this is a security requirements. Security requirements are useful for formulating future designs, detailing test plans, and helping people understand the system. Let’s be honest, even mature software systems can behave in undocumented, unclear, and surprising ways.
Nonetheless, what we really want is to predict all the specific scenarios that lead to dangerous states. Likewise, STPA has a simple yet comprehensive way to build an analysis to find all scenarios that could cause a quota adjuster to violate this security requirement.
So, in terms of copyright adjusters, we can look at four typical scenarios.
-
Scenarios where the permissions adjuster behaves incorrectly.
-
Scenarios where permission adjusters get incorrect feedback (or no feedback at all).
-
A situation in which the quota system never receives an operation from the rights allocator (even though the rights allocator attempts to deliver the operation).
-
Scenarios where the quota system behaves incorrectly.
When analyzing copyright adjusters, one specific scenario quickly came to mind. It gets feedback from the quota service about the current resource usage. When implemented, the calculation of current resource usage is complex, involving different data collectors and some tricky aggregation logic. What if something goes wrong with this complex calculation and the value is too low? In short, the permission adjuster will react exactly as designed and reliably shrink service quotas to incorrectly lower usage levels.
This is exactly the disaster we want to prevent.
So far, much attention has been paid to how to make the quota adjustment algorithm correctly and reliably produce the correct output, that is, the operation of adjusting the service quota. However, the feedback path, including the service’s current resource usage, is less well known.
This highlights one of the key advantages of STPA – by looking at the system hierarchy and modeling the system in terms of control feedback loops, we find problems in the control path and feedback path. As we run STPA on more and more systems, we find that the feedback path is often not as easy to understand as the control path, but is equally important from a system security perspective.
As we delved into the feedback paths of the adjusters, we saw many opportunities to improve them. None of these changes look like traditional reliability solutions, and it doesn’t come down to using different SLOs and error budgets to manage regulators. Instead, solutions arise in other parts of the system and involve redesigning previously seemingly unrelated stacked parts, again a strength of STPA’s systems theory approach.
In the 2021 incident, error feedback about resources used by critical services in Google’s infrastructure was delivered to the adjuster. The permission adjuster calculates a new quota that allocates far fewer resources than the service actually uses. As a precautionary measure, the quota cuts were not implemented immediately but continued for several weeks to give someone time to intervene in case the quotas were wrong.
Of course, big events are never simple events – the next problem is that, despite the delay being added as a safety feature, feedback about the pending changes was never sent to anyone. The entire system was in danger for weeks, but because we weren’t looking for it, we missed the opportunity to prevent the ensuing damage. A few weeks later, quota reductions caused severe power outages. Using STPA, we foresee similar issues across many different systems at Google.
As Levison writes in his book Designing a safer world: “exist [STAMP]understanding why an accident occurred requires determining why controls were ineffective. Preventing future accidents requires a shift from a focus on preventing failures to the broader goal of designing and implementing controls to enforce necessary limitations. This shift in perspective—from trying to prove that there is no problem to effectively managing known and potential hazards—is a key tenet of our approach to system safety.
2025-01-03 11:38:49