May 22–24 Downtime Post Mortem
During the week of May 22, Coinbase didn’t meet customer expectations. We were down for extended periods and slow in others. This is a technical post-mortem of the events of that week. We are investing significant resources to make our systems scale with the next 10x growth in traffic.
On May 22, Coinbase started experiencing sustained high user traffic as a result of large Bitcoin and Ethereum price fluctuations. Over the next week, our low points in traffic far exceeded previous peaks in the preceding months. As a result, we bumped up against a series of capacity limits which led to long stretches of performance degradation and periods of full-site downtime.
On the first day of the incident (May 22), unoptimized database queries and a handful of missing indices on our primary MongoDB instances compounded, leading to a surge in database queues and degraded performance across coinbase.com. We were able to restore full service and solve our database queues by building key indices and optimizing expensive application queries.
On the second day of the incident (May 23), the MongoDB query planner for our clusters began to exhibit poor query selection under high load, resulting in query times several orders of magnitude slower than identical queries run on the correct index. This poor query selection led to short, intermittent downtime periods during which database queues would rise rapidly until the poorly-planned query completed.
To resolve these query planning issues, we spent the next 24 hours (May 23 — May 24) performing major MongoDB version upgrades, bringing all of our primary clusters up to version 3.0. We had been preparing to do these upgrades for weeks, and in consultation with MongoDB experts we decided to move forward under heavy load with the hope of mitigating these query planning issues. The upgrades to 3.0 proceeded smoothly despite the load and lead to drastic improvements in query planning and performance across the board, resolving our MongoDB issues for the duration of the incident.
In the weeks since the incident, we’ve worked hard to bring our MongoDB clusters up to date, and as of June 12 have successfully upgraded all of our clusters to version 3.2 with the new WiredTiger storage engine. In total over a period of three weeks we migrated 10 MongoDB clusters from a mix of 2.4 and 2.6 and have seen dramatically increased query performance and consistency.
As traffic continued to reach new heights on the third day of the incident (May 24), we hit the capacity limit of our primary Redis ElastiCache cluster. Due to the single-threaded design of Redis, we became CPU bound on one core which we fully saturated with over 20,000 requests per second.
Over the next day, we were able to restore service through a combination of improving expensive queries, sharding existing queries across multiple logically separated new instances, and reducing the total volume of queries within the application by 40%. By May 25 at 19:00 PST we were able to restore and maintain full service across our infrastructure.
We are committed to improving site performance so that we can serve our customers during even the most volatile periods in cryptocurrency.
We are rapidly expanding our team, if you are interested in solving challenges like these, we’re hiring!