
Quick takes on the recent OpenAI public incident write-up – Surfing Complexity
RecentlyOpenAI Publish a public article There are a lot of great details here about an incident they had on December 11th! Here are some of my off-the-cuff observations:
saturation
With thousands of nodes performing these operations simultaneously, the Kubernetes API servers are overwhelmed, bringing down the Kubernetes control plane in most large clusters.
the term saturation Describes a situation in which a system has reached the limits of its processing capabilities. This is sometimes called overload or resources exhausted. In the OpenAI incident, the Kubernetes API server was saturated from receiving too much traffic. Once this happens, the API server will not function properly. As a result, their DNS-based service discovery mechanism ultimately failed.
Saturation is an extremely common failure mode in accidents, and OpenAI provides us with another example. You can also read some previous articles on coverage of public events involving saturation: Yunyao, rogersand relaxation.
All tests passed
The change was tested on a temporary cluster and no issues were found. The impact is specific to clusters above a certain size, and the DNS cache on each node delays visible failures long enough for the rollout to continue.
One of the reasons it is difficult to prevent saturation-related events is because all software Functionally Correct, because it passes all functional tests, and failure modes rear their ugly heads only when the system is exposed to conditions that only occur in a production environment. Even a canary attack on production traffic won’t prevent problems that only occur under full load.
Prior to deployment, our main reliability concern was the resource consumption of the new telemetry service. Prior to deployment, we evaluated resource utilization metrics (CPU/Memory) across all clusters to ensure that the deployment would not disrupt executing services. While resource requests were tuned on a per-cluster basis, no precautions were taken to assess Kubernetes API server load. This rollout process monitors service health but lacks adequate cluster health monitoring protocols.
Of note, engineers did verify changes in resource utilization on the cluster where the new telemetry configuration was deployed. The problem is the interaction: it increases the load on the API server, which brings us to the next point.
complex, unexpected interactions
It’s a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways.
When we look at system failures, we often look for problems in individual components. But in complex systems, identifying complex, unexpected interactive Provides a better understanding of how failures occur. You don’t just want to see the boxes, you want to see the arrows as well.
In a nutshell, the root cause was that the new telemetry service configuration accidentally created a massive Kubernetes API load in a large cluster, overwhelming the control plane and breaking DNS-based service discovery.
“So we launched a new telemetry service and, yada yada yada, our services can no longer call each other”.
In this case, the surprising interaction is between the failure of the kubernetes API and the resulting failure of the services running on top of kubernetes. Generally, if you have a service running on top of kubernetes, and your kubernetes API becomes unhealthy, your service should still remain up and running, you just won’t be able to make changes to the current deployment (e.g. deploy new code, change pods) . However, in this case, a failure in the kubernetes API (control plane) ultimately causes the behavior of the running service (data plane) to fail.
Coupling between the two? It turned out to be DNS.
domain name system
In a nutshell, the root cause was that the new telemetry service configuration accidentally created a massive Kubernetes API load in a large cluster, overwhelming the control plane and breaking DNS-based service discovery.
The impact of changes is spread over time
DNS caching increases the delay between changes being made and service starting to fail.
The nature of DNS caching is one of the reasons why DNS-related events are difficult to handle.
This can make diagnosing significant changes more difficult when the effects of the change are spread out over time. This is especially true when the critical service that stopped working (service discovery in this case) is not what was changed (telemetry service deployment).
DNS caching made the problem less obvious until it was rolled out fleet-wide.
In this case, due to the nature of DNS caching, the impact is spread out over time. But we often intentionally spread out changes over time because if the change we roll out turns out to be a breaking change, we want to reduce the scope of the impact. This works great if we detect an issue during rollout. However, this also makes detecting the problem more difficult because the error signal is smaller (by design!). If we detect an issue only after deployment is complete, it can be difficult to correlate changes with effects because changes can be erased over time.
Failure modes make repair more difficult
In order to fix it, we needed to access the Kubernetes control plane, but due to the increased load on the Kubernetes API server, we were unable to access the control plane.
Sometimes, failure modes that disrupt the systems on which production depends can also disrupt the systems on which operators rely to do their jobs. I think James Mickens Said best when he wrote:
I don’t have tools because I ruined my tools with my tools
Facebook also encountered similar problems after experiencing similar problems. Major power outages in 2021:
As our engineers worked to figure out what was happening and why, they faced two huge obstacles: first, not being able to access our data center through normal means because their network was down, and second, a complete loss of DNS that we normally use There are many internal tools to investigate and resolve such outages.
Such problems often require operators to improvise immediate solutions. OpenAI engineers employed a variety of strategies to bring the system back to health.
We discovered the issue within minutes and immediately launched multiple workflows to explore different ways to get our cluster back online quickly:
- Reduce cluster size: Reduce the aggregate load on the Kubernetes API.
- Block network access to the Kubernetes management API: Block new expensive requests, giving the API server time to recover.
- Scaling the Kubernetes API server: increasing the resources available to handle pending requests, allowing us to apply the fix.
By working on these three issues in parallel, we eventually regained enough control to remove the offending service.
Their interventions were successful, but it’s easy to imagine a situation where one of the interventions accidentally made things worse. As Richard Cook points out: All the actions of practitioners are gambling. Events always involve the uncertainty of the present, which is easy to miss when we look back on the past with full knowledge of how events unfolded.
Changes to improve reliability
As part of our efforts to improve reliability across the organization, we have been working to improve our cluster-wide observability tools to provide enhanced visibility into system status. At 3:12 PM PST, we deployed a new telemetry service to collect detailed Kubernetes control plane metrics.
This is a good example Unexpected behavior of subsystems whose primary purpose is to improve reliability. Here’s another data point I have Conjectures about why reliable systems fail.
2024-12-15 06:01:43