The Infrastructure Team at Coinbase has the goal of enabling any engineer in the company to quickly and securely deploy complex infrastructure. This effort started with our secure deployment pipeline Codeflow, was extended by our codification tooling GeoEngineer, and stress tested by our 24 hour whole-infrastructure rolling exercise Scorched Earth.1
Our latest project to empower engineers was to codify all of our infrastructure with the goal of allowing any engineer in our organization to propose changes in a secure, consistent manner.
Choose the Right Tool
Prior to this project, the majority of our infrastructure was codified, although using different tools and different standards. As our clouds evolved in complexity, we ran into the shortcomings of our existing tools and experimented with a variety of ways to scale our effort:
- CloudFormation — Core infrastructure at Coinbase (e.g. VPCs, route tables, VPN configuration) as well as IAM policies were codified using CloudFormation. While this originally worked for our first application of the CloudFormation templates, CloudFormation’s weak ability to plan or review changes led us to fall back to manual changes and let these templates fall out of sync.
- Terraform — Next, we started to work with Terraform for newer services, however due to shortcomings with the tool, we never committed fully and were left with a mix codified and manual configured resources, (along with a lot of state files across our AWS accounts).
- GeoEngineer — To work around the aforementioned shortcomings in Terraform, we began developing GeoEngineer. As a result, many of our core apps deployed by Codeflow had already been codified using GeoEngineer.
After evaluating the above options as well as Ansible and SaltStack, we decided to double down on our GeoEngineer building upon Terraform’s power.
Picking a Metric
Leading into this project we validated GeoEngineer with well defined 3-tier-applications. This helped eliminate obvious bugs but we still had the large task of codifying existing and bespoke cloud resources. Our largest deployments often contained new, complex services like replicated Elasticache clusters, multiple encrypted S3 buckets for temporary storage and more.
Given that GeoEngineer has insight into all of the resources, codified or not, that exist in your infrastructure, we were able to use
geo status to produce the exact percentage of codified versus uncodified resources in our infrastructure. Selecting and reinforcing this metric as our north star during the project made it simple to keep track of our progress towards 100%.
Along with returning the resource codification percentage,
geo status returns a list of uncodified resources which we were then able to reference as a primary source during our effort. This output made collaboration easy, since it was never difficult to determine which resources to tackle next.
As a secondary metric, we utilized CloudTrail data to visualize the level of automation as Human (using dashboard) vs Computer (geo apply) actions. As a company with a deep focus on security, reducing the number of humans capable of taking actions, as well as reducing the total number of actions that those humans take, helps to reduce our exposure to both mistakes and malicious intent.
At Coinbase, we use AWS tags to categorize, document, and define ownership of our infrastructure. Resource codification was the perfect opportunity to rethink our tag standards and ensure that all of our resources are properly classified.
Using GeoEngineer, we are able to implement resource level validations to require that any resources codified in our infrastructure upholds our tag standards. An example of standardized tags we use to classify our resources are
Org which correlate respectively to the name of the specific resource, the project it is associated with, and the project’s organization within Coinbase.
Additionally, GeoEngineer can be expanded to automatically add default tags based on the context of resources.
Beyond classification, we use tags for determining the importance of a resource and directing notifications to the right channel. For example, our
pd_channel tag ensures that if reliability with a resource can’t be auto remediated that we’re sending PagerDuty alerts to the right team.
As our codified effort illuminated not just some, but all of the resources within our infrastructure, we began to find dated configurations and even long-since deprecated resources. These ‘cruft’ resources are unused or unidentified resources which have been created by hand through the console or lost over time. Cruft resources are bad because not only because they’re an extra expense, but because unknown resources can lead to security holes or improper utilization.
Tackling cruft in your infrastructure can be a tedious process. One mislabeled SQS topic could actually be the SQS topic and inadvertently removing or modifying such a topic might take down an entire segment of the organization. Our technique to tackle this cruft began by using the output of
geo status to list every single uncodified resource on a shared spreadsheet, and then carefully identifying and taking action on each resource.
We utilized three primary resources to ensure we were identifying and deleting the correct resources. First, CloudWatch’s metrics dashboard was invaluable to clarify utilization patterns for specific resources and show exactly when a resource stopped or started being used. Second, CloudTrail provided key insight into which specific user created a resource or interacted with it recently. Third, searching VPC flow logs allowed us to validate resources which hadn’t recieved any meaningful traffic.
In the end, we deleted over 270 resources across all of our environments, not including all of the resources we were able to rename and properly categorize through the effort. This was a long and tedious yet very rewarding process.
Keep it Codified
Our experience tells us that infrastructure codification follows the rule of entropy. No matter how hard you try, someone will want to log into the console and create or make changes to live resources. In order to protect our efforts, we want to set up the right tools and culture to keep our standards from slipping.
One of the forcing functions we utilize to prevent slippage is creating a bot called Mars to frequently run
geo status and generate a percentage metric of uncodified resources that we can use to automatically generate notifications when we stray from our target.
Another strategy we use is to parse the output of
geo plan to send notifications if there is a divergence between our code and the actual state of our AWS environments. This output alerts us to potential security issues like if someone were to modify an IAM policy without updating the policy file template.
Ensuring that we’ve accounted for all of our resources is just the first step on the road to completely automating our infrastructure. One of the ideas we’re working towards next is building a bot to
geo apply pull requests based on consensus to further reduce any individual’s access which following our mission to empower the breadth of engineering at Coinbase.
There’s a lot more to come - if you’re interested in being part of what we’re doing, we’re hiring!