Supply Chain Attacks & Package Managers - a Solution?

In this When Will We Learn post, Drew DeVault talks about supply chain attacks against language package managers (npm, PyPI, cargo, etc…) - and compares them to official Linux distribution repositories (deb, rpm, etc…).

The conclusion drawn was:

The correct way to ship packages is with your distribution’s package manager. These have a separate review step, completely side-stepping typo-squatting, establishing a long-term relationship of trust between the vendor and the distribution packagers, and providing a dispassionate third-party to act as an intermediary between users and vendors. Furthermore, they offer stable distributions which can be relied upon for an extended period of time, provide cohesive whole-system integration testing, and unified patch distribution and CVE notifications for your entire system.
— Drew DeVault, When Will We Learn

I think I agree with this, essentially. We do need to change the way we do dependencies when developing - and having someone else review packages would help reduce supply chain attacks.

I wanted to try and figure out if this solution - use official Linux distribution packages instead of language ones - would work in practice, what that might look like, and how that might scale.

Debian vs PyPI

There are lots of Linux distributions and lots of language package managers [1]:

Repository	Language	Package Count	Avg. Growth/day	3rd Party Vetting?
npm	JavaScript	1,965,443	962	❌
Maven	Java	473,973	155	❌
PyPI	Python	375,716	207	❌
RubyGems	Ruby	171,620	17	❌
Crates	Rust	83,189	75	❌
Debian	Debian Linux	96,728	?[2]	✅
Arch AUR	Arch Linux AUR	74,694	26	❌
Arch	Arch Linux	13,006	?	✅

Repository

Language

Package Count

Avg. Growth/day

3rd Party Vetting?

npm

JavaScript

1,965,443

962

❌

Maven

Java

473,973

155

❌

PyPI

Python

375,716

207

❌

RubyGems

Ruby

171,620

❌

Crates

Rust

83,189

❌

Debian

Debian Linux

96,728

?[2]

✅

Arch AUR

Arch Linux AUR

74,694

❌

Arch

Arch Linux

13,006

✅

I’m not going to try to compare them all - I’m going to pick two. I’m going start with Debian - one of the largest open source collaborations in the world, and the Python package repo, PyPI - a medium-sized language package repo.

So, the Debian stable repo currently contains 96,728 packages - packages curated and maintained by dedicated 3rd party package maintainers. The PyPI repo contains 375,716 packages - uploaded by anyone, usually whoever wrote the Python module, but, really anyone. The python repo has roughly three times the number of packages as Debian.

If you want to look, you can download the Debian stable package list from here.

There are actually some python packages in the Debian repo, that you can use apt to install, if you like. There are currently about 762 of them:

➜ tail --lines +7 allpackages.txt | grep ^python- | wc -l
762

I think that these are mostly system utilities that are written in python & their dependencies, or just random python stuff that someone felt like getting into the Debian repo.

What would pypi.debian.org look like?

From the point of view of a user, there are two major different between the Linux system package managers & the language ones:

System Wide

The first is that the system ones are (mostly) intended to install packages globally, at the system/OS level - and the language ones now mostly install into a folder/local virtual environment [3]. This means that you can have an independent set of packages installed for each project that you’re working on. This avoids the version clashes/dll/dependency hell type stuff that happens if you have one global set of packages and is currently considered “best practice”, mostly.

So, a Debian python repo wouldn’t turn everything into .deb packages and use apt - not if you wanted anyone to use it. To get any traction with users, it would have to work the same way as upstream PyPI and work with the same tools & workflow - pip, poetry, etc… You’d just configure your tools to talk to pypi.debian.org, instead of pypi.org.

The difference would be that the packages are hosted by Debian and vetted/maintained by Debian package maintainers, like Debian stable deb packages are.

Package Freshness

The second major difference is package freshness. Language package managers like PyPI have the very latest version of everything all the time. Developers publish new packages whenever they release new version of their packages, often completely automatically. The versions of packages in the Debian stable repo are fixed at release time, and only get urgent security fixes after that - hence the name “stable”. There’s also Debian Testing, which has more up-to-date packages, which will become the next stable repo when the next version of Debian is released. In general, language repo’s are always up-to-date and Debian repos are always behind.

How many maintainers would you need?

According to this list, the Debian project currently has 240 maintainers. Given that Debian has 96,728 packages / 240 maintainers, that’s 403 packages each.

Anyway, if we just extrapolate those numbers to a Debian version of PyPI, that would mean that you’d need… 375,716 packages / 403 each = 932 maintainers to run it. That’s quite a lot. The entire Debian project membership is currently 1022 people.

So, if you wanted to maintain a Debian version of the python package repo, with roughly the same amount of package vetting as Debian stable, you’d need a volunteer effort about the same size as the whole Debian project, all over again.

You’d also need to add one new maintainer roughly every two days, to keep up with new package growth.

You can obviously argue these numbers, but whatever the Debian project is doing, they seem to have been doing it fairly successfully since 1993; whatever it is, it looks at least somewhat sustainable.

If you think about it, all these packages from all these different repositories are, roughly, the output of the open source ecosystem. If you want to get someone else, other than the developers, to review all this stuff, you are either going to need your existing volunteer developers to up their volunteer workload and review each other’s stuff - or you are going to need to get a load more volunteers from somewhere.

Anyway, that seems like a big ask - a lot of people. Maybe we could optimize this somehow - work smarter, not harder?

20% of the packages, 80% of the value?

Maybe we don’t need everything, just the popular stuff? How many packages account for 80% of downloads?

According to PyPI Stats, PyPI had a total of 14,756,299,061 package downloads last month. Fourteen billion package downloads per month - that’s quite a lot! The most downloaded package was boto3, with 325,102,697 downloads. So that package accounted for 2.2% of all downloads.

How about the top 20 packages?

Package	Downloads	% of Total
Total	3,106,681,096	21.05%
boto3	325,102,697	2.20%
urllib3	210,456,675	1.43%
botocore	207,095,211	1.40%
requests	200,489,161	1.36%
idna	172,283,921	1.17%
setuptools	168,960,136	1.15%
s3transfer	168,397,166	1.14%
typing-extensions	161,630,822	1.10%
six	152,703,179	1.03%
certifi	147,959,264	1.00%
python-dateutil	146,990,800	1.00%
pyyaml	138,941,619	0.94%
charset-normalizer	135,959,075	0.92%
awscli	121,743,694	0.83%
click	114,611,382	0.78%
wheel	112,656,886	0.76%
numpy	110,481,070	0.75%
cryptography	107,687,178	0.73%
rsa	101,669,487	0.69%
pyparsing	100,861,673	0.68%

So, the 20 most downloaded packages account for 21% of all downloads. How far down do we have to go to account for 80%?

Well, the top 5000 packages by download count are available here, which you can total up like this:

➜ curl -s https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json | jq .rows | jq -r '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ]])[] | @csv' | cut -d',' -f1 | awk '{total = total + $1}END{print total}' | numfmt --grouping

13,225,311,500

Ok, the top 5000 packages account for 13,225,311,500 downloads a month, so… 13,225,311,500 / 14,756,299,061 * 100 = 89.62% of total downloads are accounted for by the top 5000 packages. In fact, the first 805 packages account for 80% of the downloads.

Perhaps unsurprisingly, if you plot that on a graph, it produces a perfect inverted power law curve, with a very long tail:

Line graph — Figure 1. This tail continues to very slowly approach 100%, until you get down to the packages that have never been downloaded. The gray dotted lines show the 805th package accounting for 80% of the downloads.

So, you could put up a pypi.debian.org, with only 805 packages on and satisfy 80% of downloads - and only 5000 packages to satisfy 89% of downloads. Using our formula from above, you would need… only two maintainers for the 805 packages and only 13 maintainers for the 5000 package version. That sounds a lot more achievable!

These are almost certainly also the most actively updated packages, so you’d definitely need more maintainers than that - but even if you need 10 times that many, that’s still much more achievable.

But is that enough - and is it solving the right problem?

Which packages are the problem?

Thinking about where supply chain attacks happen - it’s usually not the big packages. The most downloaded python package, boto3, is maintained by Amazon’s AWS team and has many, many eyeballs on it. It would be extremely hard to slip something malicious into boto.

dependency — Figure 2. That arrow is pointing to the ideal target for a supply chain attack: xkcd #2347

I think this is probably the same for most of the popular packages - they have enough eyeballs on them already. The really juicy supply chain attacks are when you find some package that happens to be depended on by lots of other packages, but is developed & maintained by just one person. Leftpad is the obvious example of this, but there are lots of others.

In my experience, most software project dependencies follow a power law too - they depend on a few big packages, and a larger number of smaller ones. If your package repository only covers the big packages, people will either have to fall back to PyPI for the little ones (leaving a supply chain attack hole), or more likely just continue to use PyPI for everything – defeating the purpose entirely.

Does this mean that you have to support all the packages to be useful? Possibly? If you did support all packages, that would certainly make it a no-brainer to switch and adoption would be much easier. But “just support all of PyPI” doesn’t seem like an achievable goal to me - I think you’d need some way to get started smaller and work your way up.

Start with the Problem Packages

It seems to me that you could come up with a rough list of the problem packages - the ones that have few developers but lots of things depending on them - with only two pieces of data. You just need a list of all the packages on PyPI with: how many things depend on each, and how many developers work on them. It looks like the information you’d need is either available in Google’s BigQuery public datasets, both for the PyPI & GitHub data, or in the dependency data from libraries.io.

It seems to me that you could start with that list and maintain those packages in your vetted repo, and then just provide a transparent proxy to PyPI for the rest. You could then add to your list of verified packages over time, and anyone using your PyPI mirror would get less vulnerable to supply-chain attacks over time.

Do package maintainers actually do code & security reviews?

I’m sure this varies a lot by package & maintainer, but I think the answer to this is mostly not, at least for Debian. They are involved in fixing bugs in the packages they maintain - but mostly bugs that affect packaging them up for Debian. I think they’re generally focussed on just the packaging part. That doesn’t mean that they couldn’t do code & security reviews, if that was the desired outcome.

I think people would pay for this?

As language package ecosystems grow, supply chain attacks seem to be on the rise, taking advantage of this new vector into the heart of organizational development teams.

Some of these organizations pay Red Hat a subscription for Red Hat Enterprise Linux, which includes the RPM package repo’s - which provide this kind of service for the Fedora/Redhat package ecosystem. Some of these same organizations then get completely untrusted code directly from NPM/PyPI/Maven and just run it. It seems likely that some of them would probably also pay for something like pypi.redhat.com.

If you were paying developers full-time to maintain these packages, then presumably you could maintain more packages, with less people, than volunteer maintainers can in their spare time. This would further reduce the total number of maintainers you’d need.

So, yeah, I think you could probably make it work, sustainably, without needing too many people. What do you think?

References & Footnotes

Numbers mostly from http://www.modulecounts.com/ ↩
You could probably figure this out using data from https://snapshot.debian.org/ ↩
This is all configurable - apt can install packages for a single user, but doesn’t by default - but it can’t really install packages into a single folder. Similarly, pip & poetry can install packages globally, but that’s not the way most people use them. ↩

How to Contribute a Change to Nginx: I wanted to add dark mode support to the default nginx "Welcome to nginx" page. This is the process of getting that change from my brain, into the upstream nginx codebase.
Super Fast Reader Mode for the Entire Web, with Dillo Plus: Want "Reader Mode" for (almost) the entire web - and have every page load almost instanly? Here's how.
Fixing apt: Key is stored in legacy trusted.gpg keyring Warnings: How apy keys work, what .source files are and how to fix Key is stored in legacy trusted.gpg keyring Warnings
Using Windows after 15 years on Linux: I've been using Linux exclusively for ~15 yrs. This is my first time using Windows after a 15-year break. This is how it's been going.