Google Cloud Platform went dark some weeks ago in one of the most widespread outages to ever hit a major public cloud, but the lack of outcry illustrates one of the constant knocks on the platform.
Users in all regions lost connection to Google Compute Engine for 18 minutes shortly after 7 p.m. PT on Monday, April 11. The Google cloud outage was tied to a networking failure and resulted in a black eye for a vendor trying to shed an image that it can’t compete for enterprise customers.
Networking appears to be the Achilles’ heel for Google, as problems with that layer have been a common theme in most of its cloud outages, said Lydia Leong, vice president and distinguished analyst at Gartner. What’s different this time is that it didn’t just affect one availability zone, but all regions.
“What’s important is customers expect multiple availability zones as reasonable protection from failure,” Leong said.
Amazon has suffered regional outages but has avoided its entire platform going down. Microsoft Azure has seen several global outages, including a major one in late 2014, but hasn’t had a repeat scenario over the past year.
This was the first time in memory a major public cloud vendor had an outage affect every region, said Jason Read, founder of CloudHarmony (now owned by Gartner), which has monitored cloud uptime since 2010.
Based on the postmortem Google released, it appears a number of safeguards were in place, but perhaps they should have been tested more prior to this incident to ensure this type of failure could have been prevented, Read said.
Jason Readfounder, CloudHarmony
“It sounds like, theoretically, they had measures in place to prevent this type of thing from happening, but those measures failed,” he said.
Google declined to comment beyond the postmortem.
Google and Microsoft both worked at massive scale before starting their public clouds, but they’ve had to learn there is a difference between running a data center for your own needs and building one used by others, Leong said.
“You need a different level of redundancy, a different level of attention to detail, and that takes time to work through,” she said.
With a relatively small market share and number of production applications, the Google cloud outage probably isn’t a major concern for the company, Leong said. It also may have gone unnoticed by Google customers, unless they were doing data transfers during those 18 minutes, because many are doing batch computing that doesn’t require a lot of interactive traffic with the broader world.
“Frankly, this is the type of thing that industry observers notice, but it’s not the type of thing customers notice because you don’t see a lot of customers with a big public impact,” Leong said. By comparison, “when Amazon goes down, the world notices,” she said.
Measures have already been taken to prevent a reoccurrence, review existing systems and add new safeguards, according to a message on the cloud status website from Benjamin Treynor Sloss, a Google executive. All impacted customers will receive Google Compute Engine and VPN service credits of 10% and 25% of their monthly charges, respectively. Google’s service-level agreement calls for at least 99.95% monthly uptime for Compute Engine.
Networking failure takes down Google’s cloud
The incident was initially caused by dropped connections when inbound Compute Engine traffic was not routed correctly, as a configuration change around an unused IP block didn’t propagate as it should. Services also dropped for VPNs and L3 network load balancers. Management software’s attempts to revert to previous configuration as a failsafe triggered an unknown bug, removed all IP blocks from the configuration and pushed a new, incomplete configuration.
A second bug prevented a canary step from correcting the push process, so more IP blocks began dropping. Eventually, more than 95% of inbound traffic was lost, which resulted in the 18-minute Google cloud outage that was finally corrected when engineers reverted to the most recent configuration change.
The outage didn’t affect Google App Engine, Google Cloud Storage or internal connections between Compute Engine services and VMs, outbound Internet traffic, and HTTP and HTTPS load balancers.
SearchCloudComputing reached out to a dozen Google cloud customers to see how the outage may have affected them. Several high-profile users who rely heavily on its resources declined to comment or did not respond, while some smaller users said the outage had minimal impact because of how they use Google’s cloud.
Vendasta Technologies, which builds sales and marketing software for media companies, didn’t even notice the Google cloud outage. Vendasta has built-in retry mechanisms and most system usage for the company based in Saskatoon, Sask., happens during normal business hours, said Dale Hopkins, chief architect. In addition, most of Vendasta’s front-end traffic is served through App Engine.
In the five years Vendasta has been using Google’s cloud products, on only one occasion did an outage reach the point where the company had to call customers about it. That high uptime means the company doesn’t spend a lot of time worrying about outages and isn’t too concerned about this latest incident.
“If it’s down, it sucks and it’s a hard thing to explain to customers, but it happens so infrequently that we don’t consider it to be one of our top priorities,” Hopkins said.
For less risk-tolerant enterprises, reticence in trusting the cloud would be more understandable, but most operations teams aren’t able to achieve the level of uptime Google promises inside their own data center, Hopkins said.
Vendasta uses multiple clouds for specific services because they’re cheaper or better, but it hasn’t considered using another cloud platform for redundancy because of the cost and skill sets required to do so, as well as the limitations that come with not being able to take advantage of some of the specific platform optimizations.
All public cloud platforms fail, and it appears Google has learned a lesson on network configuration change testing, said Dave Bartoletti, principal analyst at Forrester Research, in Cambridge, Mass. But this was particularly unfortunate timing, on the heels of last month’s coming-out party for the new enterprise-focused management team at Google Cloud.
“GCP is just now beginning to win over enterprise customers, and while these big firms will certainly love the low-cost approach at the heart of GCP, reliability will matter more in the long run,” Bartoletti said.