Post-Mortem: GCP/Cloudflare Outage and the Importance of Humility
I say it to clients all of the time - “You work in IT… Failure is not just an option, it’s a fact of life.” Anyone who has worked in IT knows what I’m talking about. Things just…happen…and we can’t control it. Disaster Recovery is a huge undertaking, but it’s also a huge part of your IT strategy regardless of what field you work in. And yes - every company is an IT company now. But I digress…
On Thursday, June 12, 2025 it seemed like the internet went down and took a lot of services with it. This outage hilights that, no matter how much “High-Availability” you try to bake into your IT infrastructure, outages are far from impossible. This is why prioritizing Disaster Recovery over High-Availability is important in your IT strategy. Services like Spotify, Google, Cloudflare, etc. all boast their Highly Available setups, but they value their Disaster Recovery plans more. Outages cost money - period.
One thing I like to do is learn from these incidents. It makes me a better IT consultant when I can look at what a large provider like Google or Cloudflare did to respond to their outages and see how I can improve my thought process for a Disaster Recovery plan implementation. As I reviewed Cloudflare’s blog post on the outage, one thing really stood out to me - humility.
But first…..
What Happened?
Since I noticed that Google Cloud was experiencing an outage, I looked at their status page for an Incident Report. I found one for June 12 named Multiple GCP products are experiencing Service issues. Seems about right! So what happened? I’m going to let Google explain…
On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash.
Let’s pause here. I’ve added a little emphasis to a key statement - “The issue with this change was that it did not have appropriate error handling nor was it feature flag protected.” The way Google tends to roll out features is by using feature flags, rolling out the product, and then gradually enabling the feature flags on a region-by-region basis to control the blast radius of a bug being missed in testing. In this case, there was a new feature which added additional quota policy checks to Service Control - the management and control plane which is responsible for ensuring that API requests are properly authorized and has the proper policy and “other appropriate checks” required to meet the APIM endpoints requirements. Service Control is a regional service which has its own regional databases. Data gets replicated “almost instantly globally” so that Service Control can manage quota policies for the GCP service and GCP’s customers.
As it turns out, this new feature in Service Control turned out to be what I like to call a timebomb - it works great, no issues at all… until it doesn’t. That “unti it doesn’t” moment turned out to be June 12, 2025…
On June 12, 2025 at ~10:45am PDT, a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds. This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment.
And…. kaboom
. There it is. The Service Control binary entered into a crash-loop due to the lack of error handling. Worse yet - their data replication system had shipped this new policy globally within seconds, meaning within seconds the Service Control binaries in all regions were entering into crash loops. But… they built a “red button” into this service, right? Well - sort of…
Within 2 minutes, our Site Reliability Engineering team was triaging the incident. Within 10 minutes, the root cause was identified and the red-button (to disable the serving path) was being put in place. The red-button was ready to roll out ~25 minutes from the start of the incident. Within 40 minutes of the incident, the red-button rollout was completed, and we started seeing recovery across regions, starting with the smaller ones first.
Turns out - the “red-button” took 40 minutes to roll out. That’s not terrible considering the scale of Google, but not great either. However, a lot of damage had been done already in their larger regions…
…as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load. At that point, Service Control and API serving was fully recovered across all regions. Corresponding Google and Google Cloud products started recovering with some taking longer depending upon their architecture.
Service Control DDoS’d itself here by trying to spin up tasks without a good backoff policy, which overloaded the database that it relies on, further compounding the problem. The good news is that these databases (and services) are regional, so Google was able to route inbound traffic to other regions temporarily to get themselves back online while they got the Service Control task creation process to throttle down a bit. It worked, Google’s engineers saved the day. After some additional recovery time, their customers were able to eventually recover and all was right with the world.
Wait - I thought we were talking about Cloudflare?
I want to look at Cloudflare’s post-mortem writeup now.
On June 12, 2025, Cloudflare suffered a significant service outage that affected a large set of our critical services, including Workers KV, WARP, Access, Gateway, Images, Stream, Workers AI, Turnstile and Challenges, AutoRAG, Zaraz, and parts of the Cloudflare Dashboard.
Cloudflare runs their products as a series of services. Their services consume their services. Notice that first one in the list - Workers KV. Well….
The cause of this outage was due to a failure in the underlying storage infrastructure used by our Workers KV service, which is a critical dependency for many Cloudflare products and relied upon for configuration, authentication and asset delivery across the affected services. Part of this infrastructure is backed by a third-party cloud provider, which experienced an outage today and directly impacted availability of our KV service.
Remember Google’s ticking timebomb? Google is this “third-party cloud provider” here. But Cloudflare goes on to say something which shows a lot of humility:
this was a failure on our part, and while the proximate cause (or trigger) for this outage was a third-party vendor failure, we are ultimately responsible for our chosen dependencies and how we choose to architect around them.
Cloudflare could have 100% have thrown Google under the proverbial bus here - and has every right to do it. Instead, they chose to own the failure - “we are ultimately responsible.” And this is an important point here - Cloudflare is responsible for how Cloudflare architects their services, including their dependencies. They made a choice to utilize GCP as a dependency, and they are responsible for that decision. They made a choice to build in a lot of dependency from their other services into the Workers KV service, and they are responsible for that decision. And they are owning it here. Did they cause this outage? No. Absolutely not. But their choices caused their outage, and they’re choosing to own that here. Humility goes a long way in the IT industry, and owning this as “a failure on our part” shows that Cloudflare absolutely understands that their own architecture is what caused their outage. This is really cool, and something that needs to be called out.
Their writeup goes on to describe how many of their services rely on the Workers KV service, with details into how the Workers KV outage impacted their other services. It’s a long list. They then go into a bit more detail on how Workers KV is architected, and why this GCP outage caused them to have problems with the Workers KV service:
Workers KV is built as what we call a “coreless” service which means there should be no single point of failure as the service runs independently in each of our locations worldwide. However, Workers KV today relies on a central data store to provide a source of truth for data. A failure of that store caused a complete outage for cold reads and writes to the KV namespaces used by services across Cloudflare.
See that - Workers KV relies on a central data store. Again - another timebomb . And they knew it
Workers KV is in the process of being transitioned to significantly more resilient infrastructure for its central store: regrettably, we had a gap in coverage which was exposed during this incident. Workers KV removed a storage provider as we worked to re-architect KV’s backend, including migrating it to Cloudflare R2, to prevent data consistency issues (caused by the original data syncing architecture), and to improve support for data residency requirements.
Again - humility. Cloudflare is owning up to this. “We had a gap in coverage which was exposed during this incident.” Nowhere are they blaming their 3P provider here - they are owning it. And they’re putting it on full display to everyone - we messed up. I wish more companies had this sort of transparency when they mess up and I wish more places were willing to “take the L” in the interest of transparency. Don’t get me wrong - Google did a decent job of it in their writeup, but it’s nowhere near as transparent as this.
The writeup then goes into the incident timeline and impact, followed by some remediation work that is ongoing which “…encompasses several workstreams, including efforts to avoid singular dependencies on storage infrastructure we do not own, improving the ability for us to recover critical services (including Access, Gateway and WARP).” One thing they recognized as a result of this incident is how many critical services have either direct or transitive dependencies on the Workers KV service, and one of their remediation tasks outlines this:
- Short-term blast radius remediations for individual products that were impacted by this incident so that each product becomes resilient to any loss of service caused by any single point of failure, including third party dependencies.
Another thing they pointed out is that - like Google, they could have DOS’d themselves and they are working to proactively prevent this:
- Implementing tooling that allows us to progressively re-enable namespaces during storage infrastructure incidents. This will allow us to ensure that key dependencies, including Access and WARP, are able to come up without risking a denial-of-service against our own infrastructure as caches are repopulated.
They then conclude their post-mortem analysis by once again owning this outage:
This was a serious outage, and we understand that organizations and institutions that are large and small depend on us to protect and/or run their websites, applications, zero trust and network infrastructure. Again we are deeply sorry for the impact and are working diligently to improve our service resiliency.
What I learned from Cloudflare
There’s a lot to unpack from this incident, and a quick blog post summarizing what happened here isn’t going to do this justice. But there’s a few take-aways I got from looking at these 2 post-mortem writeups:
- Own your mistakes. Take your medicine. “Take the L” as the younger generation likes to put it. Don’t try to bury your mistakes. Cloudflare was a victim to Google’s mistake here, and rather than tossing the blame back on Google they essentially chose to say “Our outage is our fault” and hilight how they could have done better. In fact - only one time in their post-mortem do they mention that their “third-party cloud provider” had an outage - “Part of this infrastructure is backed by a third-party cloud provider, which experienced an outage today and directly impacted availability of our KV service.” - and this is right before they say “…this was a failure on our part….” I cannot express this enough - your dependencies’ vulnerabilities are YOUR vulnerabilities, and YOU are responsible for it. Cloudflare embodied this here - “…we are ultimately responsible for our chosen dependencies….” They also never mentioned the “third-party cloud service provider” outage in their incident response timeline - just what they did to fix it.
- Your infrastructure should be resilient. This one should be obvious, but consider this - is your application’s storage service distributed? What about authentication - does your authentication service rely on a single source of truth? How is your data replicated? Look for your failure points in your application and then look for how you’re prepared to mitigate the risk of that single point of failure failing. High Availability means not only globally distributed infrastructure, it means using redundant services. Google Cloud went down, but had Cloudflare implemented using their R2 storage service along with GCP Cloud Storage, they may have not had an outage at all. They may not have even been operating in a degraded state! Do you have a “Plan C” for when your cloud provider drops off the map completely due to an outage? Cloudflare acknowledged this in their post-mortem a couple of times across the “what happened” section and the “follow-up steps” section.
- Understand the impacts of your service dependency tree. One cool thing that Cloudflare did in their writeup is talk about how their services rely on Workers KV, inlcuding the services that were impacted because Workers KV is a transient dependency. They didn’t just list the services that experienced outages, they listed why they experienced outages. Is the average customer going to care? Probably not, but it shows that Cloudflare took the time to outline why a service like Browser Rendering would be down - it relies on Browser Isolation, which relies on Gateway, which experienced issues due to a partial reliance on Workers KV. Why did the Dashboard fail? Because it depends on services which depend on Workers KV (another fun fact here - the management API was never down, just the dashboard) You get the hint - they described the failure chain, which stopped at Workers KV being down.
- Never rely on a single service provider. Did I just recommend multi-cloud? Yes, in the same way I just recommended your office should have multiple ISPs. It might feel like a waste until your single service provider is down and you’re scrambling. I like to tell people “always have a Plan C” - Plan B isn’t good enough for me. Plan B isn’t thought enough into the future. Have a backup plan to your backup plan. I remember being on a DevOps team and the deployment strategy for this application was 100% written in powershell. They used Azure DevOps to deploy the application, and all Azure DevOps was doing was running Powershell scripts. I remember questioning this strategy and was met with “because if Azure DevOps goes down, we can still deploy the application. It’ll take longer and it’s way less convenient, but we can deploy it.” Sure enough, one night during a release Azure DevOps experienced an outage and we were able to deploy the application. NASA did this with the Apollo program - they had a backup plan to get the lunar module off of the surface of the moon involving basically jumper cables and a pair of bolt cutters (no, really, I’m not making that up). Have a backup plan to your backup plan. If you plan for an emergency, it’s no longer an emergency.
Conclusion
We work in IT. Outages and failures are to be expected. In my opinion, what makes a good IT organization is their Disaster Recovery plan, their ability to learn from outages, and their ability to adapt to these casualty scenarios. Google and Cloudflare both showed transparency in this incident - Google came out and said their code didn’t meet their standards, and Cloudflare owned the failure by saying multiple times that their architectural decisions left them vulnerable. Owning your failures isn’t a bad thing - it shows that you’re adapting and learning and improving. Your post-mortems should be exhaustive - what did you learn? Or was a post-mortem just another box that you check off? Failure is always an acceptable option as long as it is a learning moment. In this case, I think Google and Cloudflare both learned some valuable lessons about their practices - and I learned a lesson about how to guide my clients through their HA/DR and Post-Mortem processes in a much more meaningful way.