duncan­lock­.net

Supply Chain Attacks & Package Managers

In this When Will We Learn post, Drew DeVault talks about supply chain attacks against language package managers (npm, PyPI, cargo, etc…​) - and compares them to official Linux distribution repositories (deb, rpm, etc…​).

The conclusion drawn was:

The correct way to ship packages is with your distribution’s package manager. These have a separate review step, completely side-stepping typo-squatting, establishing a long-term relationship of trust between the vendor and the distribution packagers, and providing a dispassionate third-party to act as an intermediary between users and vendors. Furthermore, they offer stable distributions which can be relied upon for an extended period of time, provide cohesive whole-system integration testing, and unified patch distribution and CVE notifications for your entire system.

I think I agree with this, essentially. We do need to change the way we do dependencies when developing - and having someone else review packages would help reduce supply chain attacks.

I wanted to try and figure out if this solution - use official Linux distribution packages instead of language ones - would work in practice, what that might look like, and how that might scale.

Debian vs PyPI

There are lots of Linux distributions and lots of language package managers [1]:

RepositoryLanguagePackage CountAvg. Growth/day3rd Party Vetting?
npmJavaScript1,965,443962
MavenJava473,973155
PyPIPython375,716207
RubyGemsRuby171,62017
CratesRust83,18975
DebianDebian Linux96,728?[2]
Arch AURArch Linux AUR74,69426
ArchArch Linux13,006?

I’m not going to try to compare them all - I’m going to pick two. I’m going start with Debian - one of the largest open source collaborations in the world, and the Python package repo, PyPI - a medium-sized language package repo.

So, the Debian stable repo currently contains 96,728 packages - packages curated and maintained by dedicated 3rd party package maintainers. The PyPI repo contains 375,716 packages - uploaded by anyone, usually whoever wrote the Python module, but, really anyone. The python repo has roughly three times the number of packages as Debian.

What would pypi.debian.org look like?

From the point of view of a user, there are two major different between the Linux system package managers & the language ones:

System Wide

The first is that the system ones are (mostly) intended to install packages globally, at the system/OS level - and the language ones now mostly install into a folder/local virtual environment [3]. This means that you can have an independent set of packages installed for each project that you’re working on. This avoids the version clashes/dll/dependency hell type stuff that happens if you have one global set of packages and is currently considered “best practice”, mostly.

So, a Debian python repo wouldn’t turn everything into .deb packages and use apt - not if you wanted anyone to use it. To get any traction with users, it would have to work the same way as upstream PyPI and work with the same tools & workflow - pip, poetry, etc…​ You’d just configure your tools to talk to pypi.debian.org, instead of pypi.org.

The difference would be that the packages are hosted by Debian and vetted/maintained by Debian package maintainers, like Debian stable deb packages are.

Package Freshness

The second major difference is package freshness. Language package managers like PyPI have the very latest version of everything all the time. Developers publish new packages whenever they release new version of their packages, often completely automatically. The versions of packages in the Debian stable repo are fixed at release time, and only get urgent security fixes after that - hence the name “stable”. There’s also Debian Testing, which has more up-to-date packages, which will become the next stable repo when the next version of Debian is released. In general, language repo’s are always up-to-date and Debian repos are always behind.

How many maintainers would you need?

According to this list, the Debian project currently has 240 maintainers. Given that Debian has 96,728 packages / 240 maintainers, that’s 403 packages each.

Anyway, if we just extrapolate those numbers to a Debian version of PyPI, that would mean that you’d need…​ 375,716 packages / 403 each = 932 maintainers to run it. That’s quite a lot. The entire Debian project membership is currently 1022 people.

So, if you wanted to maintain a Debian version of the python package repo, with roughly the same amount of package vetting as Debian stable, you’d need a volunteer effort about the same size as the whole Debian project, all over again.

You’d also need to add one new maintainer roughly every two days, to keep up with new package growth.

You can obviously argue these numbers, but whatever the Debian project is doing, they seem to have been doing it fairly successfully since 1993; whatever it is, it looks at least somewhat sustainable.

If you think about it, all these packages from all these different repositories are, roughly, the output of the open source ecosystem. If you want to get someone else, other than the developers, to review all this stuff, you are either going to need your existing volunteer developers to up their volunteer workload and review each other’s stuff - or you are going to need to get a load more volunteers from somewhere.

Anyway, that seems like a big ask - a lot of people. Maybe we could optimize this somehow - work smarter, not harder?

20% of the packages, 80% of the value?

Maybe we don’t need everything, just the popular stuff? How many packages account for 80% of downloads?

According to PyPI Stats, PyPI had a total of 14,756,299,061 package downloads last month. Fourteen billion package downloads per month - that’s quite a lot! The most downloaded package was boto3, with 325,102,697 downloads. So that package accounted for 2.2% of all downloads.

How about the top 20 packages?

PackageDownloads% of Total
Total3,106,681,09621.05%
boto3325,102,6972.20%
urllib3210,456,6751.43%
botocore207,095,2111.40%
requests200,489,1611.36%
idna172,283,9211.17%
setuptools168,960,1361.15%
s3transfer168,397,1661.14%
typing-extensions161,630,8221.10%
six152,703,1791.03%
certifi147,959,2641.00%
python-dateutil146,990,8001.00%
pyyaml138,941,6190.94%
charset-normalizer135,959,0750.92%
awscli121,743,6940.83%
click114,611,3820.78%
wheel112,656,8860.76%
numpy110,481,0700.75%
cryptography107,687,1780.73%
rsa101,669,4870.69%
pyparsing100,861,6730.68%

So, the 20 most downloaded packages account for 21% of all downloads. How far down do we have to go to account for 80%?

Well, the top 5000 packages by download count are available here, which you can total up like this:

➜ curl -s https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.json | jq .rows | jq -r '(.[0] | keys_unsorted) as $keys | $keys, map([.[ $keys[] ]])[] | @csv' | cut -d',' -f1 | awk '{total = total + $1}END{print total}' | numfmt --grouping

13,225,311,500

Ok, the top 5000 packages account for 13,225,311,500 downloads a month, so…​ 13,225,311,500 / 14,756,299,061 * 100 = 89.62% of total downloads are accounted for by the top 5000 packages. In fact, the first 805 packages account for 80% of the downloads.

Perhaps unsurprisingly, if you plot that on a graph, it produces a perfect inverted power law curve, with a very long tail:

Line graph
Figure 1. This tail continues to very slowly approach 100%, until you get down to the packages that have never been downloaded. The gray dotted lines show the 805th package accounting for 80% of the downloads.

So, you could put up a pypi.debian.org, with only 805 packages on and satisfy 80% of downloads - and only 5000 packages to satisfy 89% of downloads. Using our formula from above, you would need…​ only two maintainers for the 805 packages and only 13 maintainers for the 5000 package version. That sounds a lot more achievable!

These are almost certainly also the most actively updated packages, so you’d definitely need more maintainers than that - but even if you need 10 times that many, that’s still much more achievable.

But is that enough - and is it solving the right problem?

Which packages are the problem?

Thinking about where supply chain attacks happen - it’s usually not the big packages. The most downloaded python package, boto3, is maintained by Amazon’s AWS team and has many, many eyeballs on it. It would be extremely hard to slip something malicious into boto.

dependency
Figure 2. That arrow is pointing to the ideal target for a supply chain attack. xkcd #2347:

I think this is probably the same for most of the popular packages - they have enough eyeballs on them already. The really juicy supply chain attacks are when you find some package that happens to be depended on by lots of other packages, but is developed & maintained by just one person. Leftpad is the obvious example of this, but there are lots of others.

In my experience, most software project dependencies follow a power law too - they depend on a few big packages, and a larger number of smaller ones. If your package repository only covers the big packages, people will either have to fall back to PyPI for the little ones (leaving a supply chain attack hole), or more likely just continue to use PyPI for everything – defeating the purpose entirely.

Does this mean that you have to support all the packages to be useful? Possibly? If you did support all packages, that would certainly make it a no-brainer to switch and adoption would be much easier. But “just support all of PyPI” doesn’t seem like an achievable goal to me - I think you’d need some way to get started smaller and work your way up.

Start with the Problem Packages

It seems to me that you could come up with a rough list of the problem packages - the ones that have few developers but lots of things depending on them - with only two pieces of data. You just need a list of all the packages on PyPI with: how many things depend on each, and how many developers work on them. It looks like the information you’d need is either available in Google’s BigQuery public datasets, both for the PyPI & GitHub data, or in the dependency data from libraries.io.

It seems to me that you could start with that list and maintain those packages in your vetted repo, and then just provide a transparent proxy to PyPI for the rest. You could then add to your list of verified packages over time, and anyone using your PyPI mirror would get less vulnerable to supply-chain attacks over time.

Do package maintainers actually do code & security reviews?

I’m sure this varies a lot by package & maintainer, but I think the answer to this is mostly not, at least for Debian. They are involved in fixing bugs in the packages they maintain - but mostly bugs that affect packaging them up for Debian. I think they’re generally focussed on just the packaging part. That doesn’t mean that they couldn’t do code & security reviews, if that was the desired outcome.

I think people would pay for this?

As language package ecosystems grow, supply chain attacks seem to be on the rise, taking advantage of this new vector into the heart of organizational development teams.

Some of these organizations pay Red Hat a subscription for Red Hat Enterprise Linux, which includes the RPM package repo’s - which provide this kind of service for the Fedora/Redhat package ecosystem. Some of these same organizations then get completely untrusted code directly from NPM/PyPI/Maven and just run it. It seems likely that some of them would probably also pay for something like pypi.redhat.com.

If you were paying developers full-time to maintain these packages, then presumably you could maintain more packages, with less people, than volunteer maintainers can in their spare time. This would further reduce the total number of maintainers you’d need.


So, yeah, I think you could probably make it work, sustainably, without needing too many people. What do you think?


References & Footnotes


  1. Numbers mostly from http://www.modulecounts.com/
  2. You could probably figure this out using data from https://snapshot.debian.org/
  3. This is all configurable - apt can install packages for a single user, but doesn’t by default - but it can’t really install packages into a single folder. Similarly, pip & poetry can install packages globally, but that’s not the way most people use them.

Related Posts


Comments