Developers are increasingly contributing to and consuming more upstream content. However, as every community effort has proven, risks must be considered and mitigations put in place early to protect the entire ecosystem, as well as your project, product or service.
The following scenarios are considerations that products and services must think about:
Your service depends on an image published to Docker Hub, Google Container Registry, Quay, GitHub or any other public locations. Your service is quite active, instancing hundreds to thousands of images per hour across multiple locations.
- How do you secure from CDN or DNS outage?
- What if the service has an outage?
- What if the registry changes their terms of service, or is acquired by another entity?
Upstream Patching – Change in Behavior
A security vulnerability is found in the
wabbit-networks/net-monitor software, used in the Acme Rockets Service for network diagnostics. The upstream change is immediately deployed. A few hours later, it’s discovered the patched image changed the required access levels, blocking the ability to monitor specific network packets. The Acme Rockets team can fix the issue, however it was several hours of outages across multiple regions and thousands of customers were impacted. Customers lost critical information with SLA commitments, refunds and a lack of confidence in the service.
Latest Docker Hub Changes
Docker recently announced changes to their terms of service. While many were upset, the reality of storing and serving over 15 petabytes of content, worldwide, paying for storage and CDN fees was a matter of time. Whether you mitigate the concern by simply authenticating your pull commands, or pay for your usage, should you consider balancing the risk of consuming public content, from any location?
As a community, we must consider the impact we have through automation. Just because we can build on every commit, do we need to keep every built image? What is the life expectancy of the content? While we certainly want to maintain a set of content for long retention periods, is it all the content? While GitHub recently announced the GitHub Container Registry, it’s just a matter of time before the same questions are asked. Sure, Microsoft could fund the costs, but how much storage, space and power is consumed by all this content? Microsoft has a commitment to being carbon negative. How does storing every build of every developer in the world look? Remember, we’re not talking text files of git-repos. We’re talking about gigabytes of container images.
Just because you can, doesn’t mean you should…
In both cases above, there was no malicious actor. Stuff happens, but what are you doing to minimize the impact of stuff happening to you?
There are two solutions to consider; a minimum bar, and a desired goal.
Import Published Content
In this case, the public content is imported into a controlled environment.
- On a periodic bases, teams pull the content they depend upon.
- Before promoting that content, they security scan it and run some level of testing to assure the content adheres to their expectations.
- Upon successful validation, promote the content to an internal registry.
- If the content is an update to a pre-existing versions, using the same
name:versionreference, be sure to keep a history of the previous versions in case you need to roll back.
- When you do roll-back, be sure to add a new test suite and consider what else might go wrong, building a more robust testbed.
Building from Source – Break Glass Scenario
As it sounds, this is more complex. Rather than directly consume the upstream “binary”, you rebuild from source all the dependencies.
Before going into the costly steps, lets cover some of the benefits:
- Do you know how the upstream content was built? What was actually included?
- When, not if, a vulnerability is detected, how long can you wait to ship a fix to your customers? Will the upstream project move fast enough to produce a build you require?
- This is break-glass scenario where you can fix and ship a fix prior to upstream efforts. You may determine the fix for your specific usage of the software is much simpler to fix than the overall upstream project impact.
- When, not if, you must patch a production asset, your production workflow doesn’t change. You apply and test the patch. Promote it to the same registry your production systems are configured to pull from, and you never miss a beat.
Building from source involves:
- Maintaining a fork of the upstream project, but reserve changes to break-glass scenarios. This is not a fork in the definition of changes that will go back upstream.
- Build from source, producing the content with your companies signatures and validations.
- The build content is scanned and tested before being promoted to internal registries.
- As upstream changes are published, test the changes before promoting.
Practicing What We Preach
Azure services follow this build from source practice. You may notice the images in your AKS cluster, pulled to your managed VMs are not the same upstream public images. These images are built from source and maintained by the upstream team in Azure. This assures we can service our Azure customers as fast as possible. Many of the images are hosted on the Microsoft Container Registry, enabling anonymous access to customer nodes, while other internal images are hosted on the geo-replicated capabilities of the Azure Container Registry.
What’s Right for You
There are multiple choices to buffer from “stuff happening”. But, stuff will happen and when it does, will it be too late to implement a buffering solution?
Implementing Buffering with Azure Container Registry Tasks
This is a placeholder as I’ve been delayed getting this post out as I wanted to detail the steps involved.
A quick overview to outline what will be coming:
ACR Tasks provides a cloud optimized, pay per use build solution. In addition to providing multi-container builds of Windows, Linux and ARM through QEMU, it can build and test artifacts with an execution graph of sequential and concurrent build steps.
But wait, there’s more. ACR Tasks also monitors base image updates, enabling you to trigger builds when upstream changes occur.
Through this process you can:
- Define an ACR Task to build from one, or more of the images you depend upon
- As part of the ACR Task definition, scan and test image
- Only if the image passes tests, promote the image to your internal registry
- ACR Tasks can be scheduled or triggered by git commits, enabling multiple entry points.
Work in progress
Updates to Come
This is an evolving post. The biggest thing to take away, is how will you protect from the when, not if stuff happens?