layer sharing

How bv splits a container image into one OCI layer per conda package, and how it keeps those layers identical across every tool that uses them.

the problem

A typical bioinformatics tool ships as a single container image holding a full conda environment: the tool itself, its direct dependencies, the transitive closure of those dependencies, and conda's own bookkeeping files. The whole stack lives at one OCI manifest digest. The registry stores it as one set of layer blobs. Docker pulls it as one set of blobs.

Install samtools, bwa, and bcftools, and all three pull htslib, openssl, libgcc-ng, zlib, ncurses, xz, and bzip2. The bytes are identical across tools, but each image carries its own copy. The user pays for each shared package three times.

This is the same shape of problem Graham Christensen described for Nix and layered Docker images: putting an entire build closure into one tarball wastes bandwidth and storage in proportion to the redundancy across closures. The fix is structural, not incremental: build images out of layers small enough that overlap is the rule, not the exception.

the basic idea

Build one OCI layer per conda package. Make each layer content-addressed by its bytes: same (name, version, build, sha256) in, same uncompressed tar out, same digest. Docker's pull-by-digest dedup then does the rest. When two images list the same layer digest, the second image only downloads what's new.

The 10th tool a user installs shares ~60–70% of its layer set with the first 9.

The whole pipeline lives in bv-builder:

popularity.rs: counts how often each conda package appears across the registry.
layering.rs: given a tool's resolved package list, decides which packages get solo layers and which get batched.
build.rs: for each layer group, downloads, extracts, writes a deterministic tar, zstd-compresses it.
The image manifest is just the list of those layer digests.

why scale breaks the naive version

"One package per layer" works for small closures. At scale it doesn't. OCI registries impose soft limits (Docker registries reject images over ~120 layers; many runtimes warn at 64). And every layer adds a roundtrip when pulling: a 200-package image with 200 layers will rate-limit or stall.

So when there are too many packages, some have to merge into a shared layer. The catch: if the merging is arbitrary, dedup breaks. Suppose samtools and bwa both contain zlib. If zlib sits in a solo layer in samtools but in a long-tail layer in bwa, the two layer digests differ, and the user pays for zlib twice.

The merging policy must satisfy one rule: a package that's worth deduping must always land in the same layer structure across every tool that contains it. Then its digest is identical across tools, and dedup works.

popularity-based packing

The strategy lives in bv-builder/src/layering.rs:

pub enum PackingStrategy {
    OnePerPackage,
    PopularityBased { max_layers: usize },
}

Given a tool's resolved package list and a global popularity score per package (precomputed across the entire registry), the builder:

Sorts all packages by (popularity desc, name asc). Name-asc is a deterministic tie-break.
Reserves the top 2 manifest slots for meta (conda-meta JSON) and entrypoint (the bv-entrypoint.sh script).
Gives each of the top max_layers - 2 packages its own solo layer.
Concatenates the rest into a single long-tail layer at the bottom of the manifest.

let solo_count = max_layers.saturating_sub(2).min(sorted.len());
let (solo, tail) = sorted.split_at(solo_count);
// solo: one LayerGroup per package
// tail: one LayerGroup with all remaining packages

The popularity score is computed once across the registry by popularity.rs::compute_from_spec_dir: walk every tools/*/*.toml, count how many tools list each package name, dump to popularity.json. Each per-tool build reads that file and uses it to rank.

why it works

Claim. If package P is among the top max_layers - 2 most popular packages registry-wide, then P lands in a solo layer in every tool that contains it.

The threshold is fixed across builds. The popularity ranking is global, not per-tool. The sort key is deterministic. So if openssl ranks at position 4 globally, then in any tool's package list that contains openssl, sorting that subset by the same global key keeps openssl at the top of the subset (its rank within a subset is at least as high as it is globally). Within the subset, openssl stays inside the top max_layers - 2 slice and gets a solo layer.

The cross-tool invariant has a test: shared_popular_packages_get_solo_layers_across_tools in layering.rs synthesizes 100 fake tools each containing seven shared packages plus a unique one, and asserts every shared package gets a solo layer in every tool.

The follow-up test, shared_package_has_same_solo_group_across_tools, asserts the second half of the property: same package + same (name, version, build, sha256) means the same input to create_reproducible_layer, the same compressed bytes, and therefore the same digest.

reproducible bytes

For dedup to fire, two builds of the same package must produce bit-identical compressed bytes. bv-builder/src/build.rs::create_reproducible_layer enforces:

SOURCE_DATE_EPOCH = 0: file mtimes pinned to 1970-01-01.
uid/gid set to 0; usernames and groupnames cleared.
Entries sorted by path before tar serialization.
Symlinks preserved with explicit header.set_path(rel). (The tar crate's append doesn't carry the path; only append_data or a manual set_path does. Missing this dropped every symlink in early bv images.)
PAX/USTAR tar format.
zstd level 19 (deterministic given the same input).

The create_reproducible_layer_is_deterministic test runs the pipeline twice on the same directory and asserts byte-identical output.

three tools, one closure

Three tools, all use openssl and zlib; samtools also uses htslib; bwa also uses bwa-bin; bcftools also uses htslib and bcftools-bin. Set max_layers = 8.

Global popularity (counts in the full registry, illustrative numbers):

openssl       500
zlib          480
htslib         40
bwa-bin         1
bcftools-bin    1

For samtools (closure: openssl, zlib, htslib):

Sorted: [openssl, zlib, htslib]
solo_count = min(8 - 2, 3) = 3
Groups: [openssl] [zlib] [htslib] + meta + entrypoint = 5 layers total.

For bwa (closure: openssl, zlib, bwa-bin):

Sorted: [openssl, zlib, bwa-bin]
solo_count = 3
Groups: [openssl] [zlib] [bwa-bin]

For bcftools (closure: openssl, zlib, htslib, bcftools-bin):

Sorted: [openssl, zlib, htslib, bcftools-bin]
solo_count = 4
Groups: [openssl] [zlib] [htslib] [bcftools-bin]

The dedup that follows from this:

layer	appears in	bytes downloaded
openssl (solo)	samtools, bwa, bcftools	once total
zlib (solo)	samtools, bwa, bcftools	once total
htslib (solo)	samtools, bcftools	once total
bwa-bin (solo)	bwa	once
bcftools-bin (solo)	bcftools	once

Same (name, version, build, sha256) in each tool's openssl layer means the same reproducible-tar input, which means the same digest, which means Docker pulls the bytes once.

long-tail layers

Now suppose weird-thing has a closure of 80 packages and max_layers = 64:

solo_count = min(64 - 2, 80) = 62
The top 62 packages globally each get a solo layer (and dedup against every other tool that contains them).
The bottom 18 land in one long-tail layer. Those packages do not dedup, but they're rare by definition, so few tools contain them.

Expected dedup at the registry level is therefore close to optimal as long as the popularity distribution is heavy-tailed, which package ecosystems empirically are.

vs Graham's Nix layering

Graham's Nix algorithm uses each derivation's referrer count in the global build graph, plus depth and closure size as tiebreakers, and produces a topological layer ordering. bv's algorithm is the same idea, flatter: there is no equivalent global build graph for conda, so bv approximates "how reusable is this layer" with a raw cross-tool count from popularity.json, ranks, and slices.

What this loses to Nix-style:

No closure-size tiebreak. If two packages have identical scores, the alphabetically earlier one wins (deterministic but not optimal).
No version-aware popularity. openssl is one bucket regardless of version. If openssl 3.2 and openssl 3.3 are both deployed across the registry, both inherit the same rank. This is intentional (see the comment in popularity.rs): it bounds layer-order churn during version bumps, at the cost of slightly less perfect dedup mid-bump.
Long-tail is one undifferentiated bag. Nix-style layering can preserve some structure inside the tail; bv flattens it.

These tradeoffs are reasonable for the registry size bv targets (low thousands of tools), not Nixpkgs-scale.

where to read it

concept	file:lines
popularity counting	`bv-builder/src/popularity.rs`
packing strategy	`bv-builder/src/layering.rs`
reproducible tar	`bv-builder/src/build.rs` (`create_reproducible_layer`)
cross-tool invariant test	`bv-builder/src/layering.rs` (`shared_popular_packages_get_solo_layers_across_tools`)
determinism test	`bv-builder/src/build.rs` (`create_reproducible_layer_is_deterministic`)