Initiators of Supply Chain Incidents

When describing mitigations to supply chain security incidents, it helps to categorize how different incidents may be initiated. Based on the category, different mitigation strategies are more likely to be effective. This article distills these down into three major categories that frame the most effective way to protect from the inevitable. A fourth category demonstrates how any of the three may surface, causing an impact to your environment.

I’ll also explore dealing with the inevitable. As with many high-risk adventures, “it’s not a matter of if, rather when stuff happens. Motorcycle riders know this phrase well. What preventative measures were taken to minimize if incidents occur, and what solutions were implemented to respond when a failure occurs. Motorcycle riders learn and practice strategies to minimize the possibility of an accident. Many wear a helmet, protective clothing and have some sort of emergency contact and medical information on them to minimize the impact when an accident happens.

This article frames the categories for subsequent articles that cover preventative and reactive measures you may take, based on the category the incident was initiated.

Initiator Categories

To frame the mitigation types, incidents can be summarized into three major categories:

Human error: As humans, we are subject to simply making mistakes. The symlink node/yarn example highlights a well intended change broke dependencies that made it through build, test, release pipelines. The more recent 2023 FAA system outage is another great example of well-intended actions, where a simple human error had widespread impact.
Human emotions: Humans may respond to situations through subtle, but impactful changes. Through a series of human interactions, one programmer broke the internet by deleting a tiny piece of “left-pad” code. The developer wasn’t intentionally trying to break anything by removing their code, they were responding to a trademark dispute, frustrated with the process and simply removed the conflicting code. The fragile versioning and gating systems developers employ allowed the change to flow through, ultimately breaking a sprawling list of dependencies.
Bad actors: Bad actors, or cyber criminals, are those with intent to do harm. They range from individuals to well-funded nation state actors. Unlike human errors, bad actors are doing everything they can to work around any and all known gates.

Propagating Impactful Changes

Today, most software is initiated by good, emotional and bad humans. While we could argue AI will also have an impact on the auto-generation of source code, humans are still feeding the machines, so I’ll stick with the human intent. Exploring how the result may have unintended impacts through upgrade incompatibility, the source of an event may be easier to deploy than you may initially think.

Similar to the left-pad scenario where a breaking change traveled through the ecosystem, a change may be made to a dependency (a package, service, application or operating system) that renders another layer broken when the change is rolled out. Consumers may be following best practices to stay current, yet their stacks may not yet be compatible. This category is also impacted by the three previous categories where a bad actor could time-bomb a change that renders a package incompatible, disabling a critical component of a system. A payment gateway, telemetry or a valve control system may be rendered inoperable, triggering a cascading failure. As modern software continues to take dependencies on other packages, stacks and operating systems, this scenario will continue to grow.

Preventative Measures

The categorization helps us understand the types of checks that may be implemented to mitigate various incidents, before they have impact. human error, human emotions and upgrade incompatibility can likely be captured through unit and functional testing. These categories are not intended to do harm, where good test coverage can typically gate the impactful change.

Based on the categories, and “intent”, a set of preventative and mitigation strategies can be implemented.

Self Management

These are the types of unit and functional tests that may be written by the development team. The intent is: “do no harm”, so the developer is motivated to catch as many human errors as possible. Developers are motivated to minimize the mistakes they unintentionally make. They would most likely run the tests before committing their code, making it visible to others. This self inflection of a persons code makes self management a reasonable strategy for the human error category. In this case the testing infrastructure doesn’t require separation of duties.

As we transition to human emotions, a developer would likely ignore the pre-commit tests, as their emotions are driving their decisions.

In these two categories, a peer of developers can author and review various test suites.

Delegated Management and Isolation

While self management may be the place a small team starts its incubation, as the project becomes more mature, or moves into any sort of production scenarios, transition to a delegated management will establish a secure baseline.

As intent shifts from “do no harm”, to “emotional” and “bad actor” categories, a delegated oversight role is defined. A set of peers for a code review may catch accidental mistakes, however an infiltration of bad actors into a development team can create a set of peers to commit an impactful change. In delegated management, a higher level security role defines and/or approves a set of test cases and configurations. Developers have access to the code repository to review and commit code, however a higher level security group has access to the test suite, and the configuration of the repository and build environment. When a development group requests a new code repository, members of the security group configure the repository with a set of rules from how many code reviews must be completed before a merge to automated builds and test suites before a build can be published or promoted.

Delegation and isolation continues to the build environment. Are two different builds of the same source equal? While the ingredients may be the same, what prevents additional ingredients from being introduced, or the build process from being altered? Food supply incidents are often the result of a process failure. A cleaning process was incomplete, introducing a contaminant, or a broken part adds a foreign object to the completed product. A product is more than the sum of its parts.

To assure delegated management and isolation, the build environment must be secured through private networks, limiting external ingress and egress, and limited to auditable configuration changes. The build environment must be completely reset to a known and tested state, best done with ephemeral build environments. The output of the build environment can be sealed through signatures of hashes of the build artifacts before the artifacts are promoted from the secured build environment. The sealing and signing of a build doesn’t prove the quality of the code, rather it provides provenance to the source, enabling forensics when something happens.

The basic premise here is a delegation between roles, limiting the impact any one (or a few) human(s) can make across the development to distribution supply chain. With good intent, accidents do happen. Being able to do root cause analysis enables humans to continually improve their build and test suites.

Post Event Response

Well intended developers will continue to make human mistakes, including missing something in the preventative test suites. Comparatively, bad actors will go to great lengths to bypass any and all gates put in place. For every test and every security check put in place, the funding of a bad actor will influence the amount of effort put in place to intentionally work around any and all gates. Bad actors might write elusive code, bypassing test suites, or well funded bad actors may infiltrate development shops with highly placed employees, empowered to alter test suites. While every effort should be put in place to prevent human mistakes and bad actors, a multiline defense strategy doesn’t assume incidents won’t happen. For comedic relief for what to do when something happens, check out Ron White: Put the Damn Helmet On!

What To Do Next

Whether accidental human factors, or a well funded nation state bad actor caused a vulnerability to be deployed, a secured supply chain must have an efficient and fast method to communicate a response is required. It’s not a matter of if, rather when. And when something happens, what will the impact be and how fast can you evaluate and remediate the impact?

For now, evaluate your testing infrastructure. Are your developers testing for their own human mistakes? Are the test suites capable of catching an emotional developer committing some impactful code they have permissions to bypass the gates? Are your git repos and build environments standardized, making it easy for development teams to securely build and test?

In subsequent articles, I’ll discuss strategies to minimize the blast radius of the vulnerability, and strategies to quickly communicate and respond. We can all contribute to making a supply chain more secure, so we don’t need to tell mom to “Put the Damn Helmet On!”

Steve Lasker
twitter
LinkedIn