bv

examples: reproducible bioinformatics pipelines

Each example is a self-contained directory under examples/ in the registry repo, with a pinned bv.toml, a bv.lock, and a runnable script. bv sync reproduces the exact bytes anywhere; the example's run.sh drives the pipeline. Two flavors:

bv-authored pipelines

variant calling
Germline SNV and indel calling from short reads, mirroring the GATK best-practices recipe through annotated VCF.
fastpbwa-mem2samtools picardgatk4bcftools vcftoolsensembl-vepmultiqc
examples/variant-calling →
bulk RNA-seq
Selective-alignment quantification with Salmon. Output drops straight into tximport / DESeq2.
fastpsalmonsamtools multiqc
examples/rnaseq-bulk →
long-read assembly
Oxford Nanopore genome assembly + polish + QC. Flye, then medaka, then BUSCO and QUAST to score the result.
choppernanostatflye minimap2medakabusco quastsamtools
examples/longread-assembly →
metagenomics
Shotgun metagenomics: read-based taxonomy (kraken2/bracken, metaphlan4), function (humann), assembly + binning (megahit, metabat2), and MAG QC (checkm2, gtdb-tk, bakta).
fastpkraken2bracken metaphlan4humannmegahit metabat2checkm2gtdb-tk baktamultiqc
examples/metagenomics →
protein design loop
Backbone → ProteinMPNN → ColabFold → TM-align filter → Foldseek search for natural homologs. Uses bv's GPU-layer sharing across MPNN and ColabFold.
proteinmpnncolabfold foldseektmalignusalign
examples/protein-design →
structure search
"What does this protein do?" Fold first with ColabFold, then search by structure with Foldseek and re-rank with US-align.
colabfoldfoldseek tmalignusalignmmseqs2
examples/structure-search →
phylogenetics
Genomes → orthogroups → per-OG MSA + trimming + ML gene trees → concatenated species tree with bootstrap support.
prodigalorthofindermafft trimaliqtree2fasttree treetime
examples/phylo-pipeline →
ChIP-seq
Single-end ChIP-seq with input control: bowtie2 → dedup → MACS3 narrow peaks → HOMER motifs → deepTools coverage tracks.
fastpbowtie2samtools picardmacs3homer deeptoolsmultiqc
examples/chipseq →
AMR surveillance
Bacterial AMR profiling from short reads: assembly with SPAdes, annotation with bakta, AMR with three databases (abricate, AMRFinder+, RGI/CARD) for cross-validation.
fastpspadesbakta abricateamrfinderplusrgi
examples/amr-surveillance →

ports of published pipelines

Each port pins exact tool versions for one published paper so the pipeline is reproducible by digest. The science still belongs to the original authors - cite their papers, not bv.

foldseek (2024)
van Kempen et al., Nat Biotechnol
Reproduces the SCOPe40 structure-search benchmarks: fold queries with ColabFold, search with Foldseek and MMseqs2, ground-truth with US-align.
colabfoldfoldseekmmseqs2 tmalignusalign
examples/papers/foldseek-2024 →
RFdiffusion (2023) - validation half
Watson et al., Nature
The half of the design loop that's fully reproducible without RFdiffusion's custom kernels: ProteinMPNN sequence design + ColabFold validation + structural-novelty Foldseek search.
proteinmpnncolabfold foldseektmalignusalign
examples/papers/rfdiffusion-2023 →
CheckM2 (2023)
Chklovski et al., Nat Methods
Recovers MAGs from a metagenome and scores them with CheckM2's ML quality classifier. Goes through the full assemble + bin + score + dereplicate + GTDB place + annotate pipeline so the comparison with CheckM1 can be reproduced locally.
fastpmegahitbwa-mem2 metabat2checkm2drep gtdb-tkbaktasamtools
examples/papers/checkm2-2023 →
Nextstrain (2018)
Hadfield et al., Bioinformatics
The phylogenetic build layer that augur orchestrates, replaced with direct bv exec calls so every step is digest-pinned. Aligns, infers an ML tree, dates it with treetime, and assigns clades.
mafftiqtree2treetime nextclademinimap2mash seqkitsamtools
examples/papers/nextstrain-2018 →

More candidates in SHORTLIST.md (Tara Oceans, GATK best-practices, ESM-Atlas, EMP 16S meta-analysis, HOMER ChIP-seq motifs). Pull requests welcome.