Great 2020 Gmail Outage: A Story of Two Blackouts, and Lessons Learned
Great 2020 Gmail Outage: A Story of Two Blackouts, and Lessons Learned
Last week was an unhappiness for Google. It faced two major consequences; Originally the root cause of each can be described as "oops".
December 14 for 47 minutes, many Google cloud services were clearly down. They were not really - but no one could attest to them so they were inaccessible. Then, for a combined six hours and 41 minutes on Monday and Tuesday, by Google's calculations, Gmail began bouncing emails sent to some gmail.com addresses, stating that those addresses did not exist.
The company has now released a detailed report of what went wrong and they provide lessons for each IT shop. Kudos to Google for their transparency in describing the events in detail, including embarrassing bits.
What happened here
Authentication failed
For 47 minutes Google Tech is likely to be forgotten, starting in October, as part of a migration to the User ID service, which maintains a unique identifier for each account and handles OAuth authentication credentials, with a new In the quota management system, a change was made that the service is registered with the new system. it was okay. However, parts of the old system remained unchanged, and they incorrectly described the use of the User ID service as null. Nothing was done at that time due to the current grace period on the imposition of quota restrictions.
On 14 December, the grace period came to an end.
Suddenly the usage of the User ID service dropped to zero. The service uses a distributed database to store account data (it uses the Paxos protocol to coordinate updates) and rejects authentication requests when it detects old data. It was thought that with zero usage, the quota management system reduced the available storage for databases, preventing rights. Most of the read operations became outdated within minutes, generating authentication errors. And to make life more interesting for technicians trying to troubleshoot, some of their internal tools were also affected.
Google has a security check, which should detect unexpected quota changes, but did not cover the zero use edge case. Text: Even if it seems unfair, keep those edge matters in mind.
To take things forward again, Google took several steps. First, it disabled the quota management system in a datacenter, and when the situation rapidly improved, it was disabled five minutes later. Within six minutes, most services had returned to normal. Some had an impact; You can see the full list here.
But now the real work begins. In addition to fixing the root cause, Google is implementing several changes, including:
- Reviewing your quota management automation to prevent rapid implementation of global changes
- Improve monitoring and vigilance to quickly catch misconfigurations
- Improve reliability of equipment and processes for posting external communications during an outage affecting internal equipment
- Evaulating and implementing a repair write failure in the User ID service database
- Improve the flexibility of GCP services to more strictly limit the impact on the data plane during user ID service failures
Gmail jumps
Gmail's failure occurred in two waves. On Monday, Google Engineering began receiving internal user reports of delivery errors and traced them to a recent code change in an underlying configuration system, resulting in an invalid domain name (rather than gmail.com) for the SMTP inbound service ) Was provided. When the Gmail Accounts service checked these addresses, it could not locate a valid user, so it generated an SMTP error 550 - a permanent error that, for many automated mailing systems, removed the user from their lists . The code change was reversed, which corrected the situation.
On Tuesday, the configuration system was updated again (Google does not say if this was the same change, re-implemented, or is another buggy), and the bounce resumed. The changes were reversed, and Google has committed to the following:
- Update existing configuration difference tests to detect unexpected changes to the SMTP service configuration before implementing changes.
- Improve internal service logging to allow more accurate and faster diagnosis of similar types of errors.
- Apply additional restrictions on configuration changes that may affect production resources globally.
- Improve static analysis tooling for configuration differences for more precise project differences in production behavior
- If you want to read Google's full report on the Gmail outage, you can find it here. The authentication failure report is here.
No comments