November 29th-December 1st Post Mortem

Over the past year, Coinbase has invested significant effort to improve the scalability of our systems during periods of peak traffic. Despite these efforts, Coinbase experienced slow response times and intermittent downtime due to increased user traffic on November 29th, November 30th, and December 1st. These issues were largely the result of two undersized and over-utilized database clusters. Over the course of these three days and a period of scheduled maintenance, Coinbase engineers were able to remediate these issues by reducing total query volume and scaling the affected databases.

Relative traffic before and after the incident.

The objective of this post is to provide a technical overview of the challenges we faced last week and actions we took to restore services to normal as quickly as possible.

November 29th

On November 29th, at 05:03 PST, Coinbase experienced a large increase in user traffic during due to large price movements in Bitcoin and Ethereum. As a result, one of our primary MongoDB clusters experienced CPU saturation leading to elevated response times across all API endpoints. These elevated response times resulted in saturated application queues which caused intermittent 502s for customers on both the Coinbase website and mobile apps.

To mitigate the database saturation, we worked to cache specific queries and reduce the load on the affected database. At 06:24 PST we began the process of vertically scaling the affected MongoDB cluster to new instance types with double the capacity. By 08:50 PST, site performance had been restored to normal.

Traffic levels continued to climb, and at 10:49 PST one of our primary Redis servers began to experience CPU saturation as a result of a series of expensive queries. To alleviate the issue, we developed fixes to help alleviate the pressure on this Redis cluster. By 13:25 PST the fixes, combined with slightly lower traffic, reduced CPU pressure to allow traffic to be served at normal response times. By 15:00 PST services were fully restored and operating normally.

At 20:10 PST we announced planned maintenance for November 30th at 23:00 PST. The purpose of this maintenance was to increase capacity by splitting both of the affected databases (MongoDB and Redis) into three new and separate clusters.

November 30th

On November 30th, starting at 05:30 PST Coinbase again began to experience elevated response times and intermittent 502s as a result of high traffic induced CPU saturation on our primary Redis cluster. Over the course of the downtime, we attempted to improve the performance of the cluster by analyzing analyzing Redis commands and pruning unnecessary query volume.

By 08:25 PST site performance was back to normal as a result of these Redis fixes and reduced overall site traffic.

At 23:00 PST, we began our scheduled maintenance. During this maintenance, we were able to evenly split out collections from our above mentioned MongoDB cluster across several new replica sets. Additionally, we took improved Redis performance by dividing up the misbehaving cluster’s keys across several new clusters.

December 1st

On December 1st, at 04:00 PST Coinbase began to experience elevated response times as a result of network saturation on one of the newly upgraded Redis clusters. During the November 30th planned maintenance, one of the newly split-out clusters had been inadvertently created with an undersized network interface (“Up to 10 Gbps” instead of “10 Gbps”).

Over the course of the morning (04:00–13:10 PST) Coinbase worked to identify the reason for the poorly performing database. During this period, Coinbase experienced intermittent periods of downtime — each period lasting between 3 and 7 minutes. At 13:07 PST the decision was made to accept 30 to 60 minutes of full downtime starting at 13:10 PST to upgrade the poorly performing database in order to resolve the elevated response times.

By 13:40 PST the new Redis cluster with higher network capacity became operational. At 14:10 PST Coinbase became fully operational in all countries. Later that day, we resolved the network saturation by identifying a query pattern which had been returning very large response bodies.

Redis NetworkBytesOut before and after tracking down the bad query pattern.

We are hiring senior backend engineers in San Francisco, London and New York. If working on this sort of challenge excites you please see our careers page.