Incident Post Mortem: June 1, 2020

By Michael de Hoog

On June 1st, Coinbase experienced an outage that impacted,, and our mobile applications. Trading through the API , which accounts for the majority of trading volume, remained functional throughout this time. We quickly discovered the root cause and remediated the issue. This post provides some more detail about what occurred.

Traffic levels from 16:00 to 16:30 PDT.

Around 16:05 PDT, the price of BTC reached USD $10,000. In connection with the rising price, we experienced a 5x traffic spike over 4 minutes. Our autoscaling was unable to keep pace with this dramatic increase in traffic.

This traffic spike affected a number of our internal services, increasing latency between services. This led to process saturation of the web servers responsible for our API, where the number of incoming requests was greater than the number of listening processes, causing the requests to either be queued and timeout, or fail immediately. Our request error rate spiked to 50%, causing customers to experience errors when interacting with and our mobile apps.

The health check is also served by these saturated processes, which caused some instances to be marked as unhealthy and taken out of the load balancer, further exacerbating this issue.

Healthy instance count (peaks show deploys, dips show instances marked as unhealthy).

In an effort to mitigate the saturation, we redeployed the API at 16:20 PDT to increase the machines serving the traffic. Once this deploy completed, the previous deploy’s instances were taken out of rotation, leading to another 2 minute outage due to instances saturating and being marked unhealthy. This was handled automatically by our autoscaling.

Looking ahead

In response to these events, we’re working on a number of improvements. We have since fixed the health endpoint to ensure that saturated instances don’t get taken out of rotation. We’re working on reducing the impact of price-related traffic spikes though pre-scaling and caching. Longer term we’re planning to improve our deployment process to mitigate some of the autoscaling issues we experienced.

We are committed to making Coinbase the easiest, most trusted place to buy, sell, and manage your cryptocurrency. If you’re interested in working on challenging availability problems and building the future of the cryptoeconomy, come join us!

Incident Post Mortem: June 1, 2020 was originally published in The Coinbase Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

—Source link—

What do you think?

Claim Bankless Token Badge – Wave 1

Ethereum by the Numbers — May 2020