Panman technical supplement
TECHNICAL SUPPLEMENT: PanMAN Software and Dataset Resources
Open-Source Implementation and Public Datasets Now Available
By Stephen L. Pendergast
Overview
The PanMAN (Pangenome Mutation-Annotated Network) compression technology described in the main article is not merely a theoretical advance—it is a fully implemented, open-source software system with publicly available datasets that researchers can use immediately. This technical supplement provides details on the software implementation, available datasets, and how researchers can begin working with PanMAN today.
Software Repository
GitHub Implementation
Repository: https://github.com/TurakhiaLab/panman
Language: C++
License: Open source
Development Status: Active
Maintainer: Turakhia Lab, UC San Diego
The GitHub repository contains:
- Complete PanMAN format specification
- C++ implementation of the core algorithms
- panmanUtils toolkit for common pangenomic analyses
- Documentation and usage examples
- Conversion utilities for interoperability with existing formats
panmanUtils Toolkit
The software includes utilities for:
- Format conversion: Import/export to VCF, FASTA, MAF, and other standard formats
- Tree manipulation: Extract, modify, and analyze phylogenetic trees
- Mutation analysis: Query mutation patterns and evolutionary histories
- Sequence reconstruction: Derive ancestral sequences at any phylogenetic node
- Alignment generation: Produce multiple sequence alignments from PanMANs
- Visualization: Generate graphical representations of evolutionary relationships
This interoperability is critical—researchers don't need to abandon existing tools and workflows to benefit from PanMAN compression.
Public Dataset Repository
Zenodo Archive
DOI: 10.5281/zenodo.17781629
URL: https://zenodo.org/records/17781629
Published: December 1, 2025 (Version v5)
Total Size: 45.5 GB
License: Creative Commons Attribution 4.0 International
The Zenodo repository provides both compressed PanMAN files and, for comparison purposes, traditionally-compressed alignment files.
Available Datasets
1. SARS-CoV-2 (Coronavirus) Datasets
Large-Scale Dataset (8 Million Genomes)
- File: sars_8M.panman
- Size: 382.7 MB
- Genomes: ~8 million SARS-CoV-2 sequences
- Genome size: ~30,000 base pairs each
- Source: Global pandemic surveillance data
- Coverage: Multiple variants including Alpha, Beta, Gamma, Delta, Omicron lineages
- Compression ratio: ~3,000× vs. uncompressed alignment
- MD5 checksum: 9b91da25c589d0ec385bcc57bc14c6ae
Comparison File:
- File: sars_8M.aln.gz.xz
- Size: 44.7 GB
- Format: Multiple sequence alignment, compressed with gzip and xz
- Compression ratio: ~22× vs. uncompressed
- Direct comparison: PanMAN achieves 117× better compression than this traditionally-compressed alignment while encoding additional evolutionary information
Metadata File:
- File: SARS-CoV-2_RefSeqIDs_8M.csv
- Size: 91.2 MB
- Contents: Reference sequence identifiers mapping to genome accessions in public databases
This dataset represents the largest pangenome ever assembled for SARS-CoV-2 and demonstrates PanMAN's capability to handle millions of genomes efficiently.
Medium-Scale Dataset (20,000 Genomes)
- File: sars_20000.panman
- Size: 2.4 MB
- Genomes: 20,000 SARS-CoV-2 sequences
- Purpose: Smaller dataset for testing and development
- MD5 checksum: 64deffe72f01893d76b1cd256668f307
2. HIV (Human Immunodeficiency Virus)
- File: HIV_20000.panman
- Size: 13.1 MB
- Genomes: 20,000 HIV sequences
- Genome size: ~9,700 base pairs
- Significance: HIV's high mutation rate and extensive recombination provide a challenging test case for PanMAN's ability to handle complex evolutionary histories
- MD5 checksum: 8b229e076f800a1dd0185f4fa4d04a3b
HIV presents unique challenges due to:
- Rapid evolution (~1% sequence divergence per year)
- Frequent recombination events
- High within-host diversity
- Multiple subtypes and circulating recombinant forms
PanMAN's network structure (connecting multiple trees with recombination edges) is particularly well-suited for HIV genomics.
3. RSV (Respiratory Syncytial Virus)
- File: rsv_4000.panman
- Size: 725.7 KB
- Genomes: 4,000 RSV sequences
- Genome size: ~15,000 base pairs
- Significance: Important respiratory pathogen, particularly in infants and elderly
- MD5 checksum: e9e6ef589c2ae9b96814823d0a20ed55
The small file size (under 1 MB for 4,000 genomes) demonstrates exceptional compression on this dataset.
4. Mycobacterium tuberculosis (TB)
- File: tb_400.panman
- Size: 5.4 MB
- Genomes: 400 M. tuberculosis sequences
- Genome size: ~4.4 million base pairs
- Significance: Large bacterial genome with clinical importance for tracking drug resistance
- MD5 checksum: 9685b7f03a8038b175655e1330ace496
M. tuberculosis has:
- Much larger genome than viruses (~4.4 Mb vs. ~10-30 Kb)
- Slower mutation rate (more conserved sequences)
- Important for studying antibiotic resistance evolution
5. Escherichia coli (E. coli)
- File: ecoli_1000.panman
- Size: 106.8 MB
- Genomes: 1,000 E. coli sequences
- Genome size: ~4.6-5.5 million base pairs
- Significance: Model organism with extensive genomic diversity across strains
- MD5 checksum: 8747663e62e6a9a331e4efbcf88043e6
E. coli demonstrates:
- Large bacterial genome
- Significant structural variation between strains
- Frequent horizontal gene transfer
- Variable genome size due to accessory genes
6. Klebsiella pneumoniae
- File: klebs_1000.panman
- Size: 210.6 MB
- Genomes: 1,000 Klebsiella sequences
- Genome size: ~5.3-5.8 million base pairs
- Significance: Opportunistic pathogen with increasing antibiotic resistance
- MD5 checksum: 4b7ea6bcf96106fb74d184d7b2b355b3
Klebsiella is notable for:
- Large genome with extensive accessory elements
- Multiple antibiotic resistance plasmids
- High rates of horizontal gene transfer
- Clinically important hospital-acquired infections
Compression Performance Summary
| Dataset | Genomes | Genome Size | PanMAN Size | Compression Ratio* | Per-Genome Storage |
|---|---|---|---|---|---|
| SARS-CoV-2 | 8,000,000 | ~30 Kb | 382.7 MB | ~3,000× | ~48 bytes |
| SARS-CoV-2 | 20,000 | ~30 Kb | 2.4 MB | ~250× | ~120 bytes |
| HIV | 20,000 | ~9.7 Kb | 13.1 MB | ~15× | ~655 bytes |
| RSV | 4,000 | ~15 Kb | 725.7 KB | ~83× | ~181 bytes |
| M. tuberculosis | 400 | ~4.4 Mb | 5.4 MB | ~325× | ~13.5 KB |
| E. coli | 1,000 | ~5 Mb | 106.8 MB | ~47× | ~107 KB |
| Klebsiella | 1,000 | ~5.5 Mb | 210.6 MB | ~26× | ~211 KB |
*Compression ratio calculated vs. uncompressed multiple sequence alignment format
Key observations:
- Compression improves dramatically with dataset size (note SARS-CoV-2 8M vs. 20K)
- More conserved organisms (sharing more recent common ancestry) compress better
- Even bacterial genomes 100-500× larger than viral genomes achieve substantial compression
- Per-genome storage decreases as dataset size increases due to shared ancestry exploitation
Technical Specifications
PanMAN File Format
PanMAN uses a binary format optimized for:
- Fast random access to specific sequences
- Efficient mutation queries
- Rapid phylogenetic tree traversal
- Minimal memory overhead for large datasets
The format stores:
- Root sequence: Single ancestral genome at the tree root
- Tree topology: Phylogenetic relationships between genomes
- Branch annotations: Mutations (SNPs, insertions, deletions) on each branch
- Network edges: Recombination and horizontal gene transfer events
- Metadata: Sample identifiers, collection dates, geographic locations (optional)
Compatibility and Interoperability
PanMAN is designed to integrate with existing genomics workflows:
Input formats accepted:
- FASTA (sequences)
- Newick (phylogenetic trees)
- VCF (variant calls)
- MAF (multiple alignment format)
Output formats generated:
- FASTA (reconstructed sequences)
- VCF (variant calls)
- Newick (phylogenetic trees)
- MAF (multiple alignments)
- Custom visualization formats
Compatible with:
- Standard phylogenetic inference tools (IQ-TREE, RAxML, FastTree)
- Variant calling pipelines (GATK, bcftools)
- Multiple alignment tools (MAFFT, MUSCLE, Clustal)
- Visualization software (FigTree, Nextstrain, Phandango)
System Requirements
For analysis of provided datasets:
- Modern x86-64 processor
- RAM: 4-16 GB (scales with dataset size)
- Storage: Minimal (datasets are highly compressed)
- OS: Linux, macOS, or Windows with WSL
For creating new PanMANs:
- Additional RAM for alignment and tree construction
- Computational requirements depend on:
- Number of genomes
- Genome size
- Phylogenetic inference method
- Alignment algorithm
The TWILIGHT tool (also from Turakhia Lab) is recommended for constructing alignments of millions of genomes, as it's specifically optimized for this scale.
Usage Statistics and Community Adoption
As of January 2026:
- Total repository views: 754
- Total downloads: 1,622
- Data volume transferred: 5.8 TB
- Active since: December 2025
These statistics indicate significant early interest from the research community even before the Nature Genetics publication date (January 12, 2026), suggesting that researchers have been testing and validating the technology.
Relevance to Prostate Cancer Research
While the current public datasets focus on pathogens (for which millions of genomes exist), the principles and software directly apply to cancer genomics:
Tumor Evolution Studies
Cancer genomes from a single patient share a recent common ancestor (the original transformed cell) and diverge through somatic mutation—exactly the evolutionary pattern PanMAN is designed to compress. Multiple biopsies or metastases from one patient could be efficiently stored and analyzed.
Multi-Region Sampling
Comprehensive tumor profiling increasingly involves sequencing 10-50 regions within a single tumor to map intratumor heterogeneity.[1] PanMAN's compression of related genomes would be ideal for storing these multi-region datasets.
Longitudinal Monitoring
Repeated liquid biopsy sampling over months or years of treatment generates time-series genomic data. PanMAN's efficient storage of evolutionary trajectories naturally represents this temporal data.
Population Studies
Large-scale germline sequencing studies (like the PRACTICAL consortium's 140,000+ prostate cancer patients)[2] could benefit from PanMAN's compression of related genomes, particularly when analyzing inherited haplotypes shared across families.
Clinical Trial Databases
Pharmaceutical companies and trial networks accumulating genomic data from thousands of patients in multiple trials could use PanMAN to:
- Reduce storage costs
- Accelerate comparative analyses
- Share data more efficiently between institutions
- Preserve complete evolutionary context for resistance studies
Getting Started: Practical Guide for Researchers
Step 1: Download Software and Datasets
# Clone the GitHub repository
git clone https://github.com/TurakhiaLab/panman.git
cd panman
# Compile the software (requires C++ compiler)
make
# Download a test dataset from Zenodo
wget https://zenodo.org/records/17781629/files/rsv_4000.panman
Step 2: Basic Operations
# Extract summary information
./panman info rsv_4000.panman
# Export sequences to FASTA format
./panman extract rsv_4000.panman --output rsv_sequences.fasta
# Generate phylogenetic tree
./panman tree rsv_4000.panman --output rsv_tree.newick
# Query mutations on specific branches
./panman mutations rsv_4000.panman --node NODE_ID
# Extract ancestral sequence
./panman ancestral rsv_4000.panman --node NODE_ID
Step 3: Create Your Own PanMAN
# Starting from aligned sequences and a phylogenetic tree
./panman create \
--alignment your_sequences.fasta \
--tree your_tree.newick \
--output your_data.panman
Detailed documentation and tutorials are available in the GitHub repository.
Integration with TWILIGHT Alignment Tool
The Turakhia Lab also developed TWILIGHT, a fast multiple sequence alignment tool optimized for very large datasets (millions of sequences). TWILIGHT and PanMAN are designed to work together:
TWILIGHT → Alignment → PanMAN
- TWILIGHT constructs the multiple sequence alignment from raw sequences
- Phylogenetic relationships are inferred (using standard tools or TWILIGHT's built-in methods)
- PanMAN ingests the alignment and tree, creating the compressed representation
This pipeline enabled the construction of the 8 million SARS-CoV-2 genome pangenome—a scale that would have been computationally prohibitive with conventional tools.
Citation and Attribution
For the PanMAN Method:
Walia, S., Motwani, H., Tseng, Y.H., et al. "Compressive pangenomics using mutation-annotated networks." Nature Genetics (2026). https://doi.org/10.1038/s41588-025-02478-7
For the Software and Datasets:
Walia, S., Turakhia, Y., & Tseng, Y.H. (2025). Pangenome Mutation-Annotated Networks [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17781629
Software Repository:
Turakhia Lab. "PanMAN: Pangenome Mutation-Annotated Networks." GitHub. https://github.com/TurakhiaLab/panman
Future Developments
According to the research team, ongoing development includes:
-
Human genome optimization: Extending PanMAN to efficiently handle the larger size and structural complexity of human genomes (3 billion base pairs vs. 10-30 thousand for viruses)
-
Cancer genome specialization: Adaptations for tumor genomics, including handling somatic mutations, copy number variations, and structural rearrangements
-
GPU acceleration: Parallel processing implementations to speed up pangenome construction and queries
-
Cloud integration: Native support for cloud storage and distributed computing platforms
-
Visualization tools: Interactive browsers for exploring evolutionary relationships in compressed pangenomes
-
Privacy-preserving features: Encrypted PanMANs that support queries on encrypted data without decryption
Technical Support and Community
GitHub Issues: Bug reports, feature requests, and technical questions
Documentation: Comprehensive guides in the repository wiki
Turakhia Lab Website: https://turakhia.ucsd.edu (contact information and updates)
Implications for Prostate Cancer Research Community
The availability of open-source, production-ready PanMAN software with extensive documentation and public datasets significantly accelerates the potential timeline for impact on prostate cancer research:
Near-term (2026-2027):
- Pilot projects adapting PanMAN for cancer genomics
- Integration with existing cancer genomics pipelines
- Proof-of-concept studies on prostate cancer datasets
Medium-term (2027-2029):
- Clinical trial networks adopting PanMAN for data management
- Large cancer genomics consortia testing at scale
- Publications demonstrating cancer genomics applications
Long-term (2029+):
- Routine use in clinical genomics laboratories
- Standard format for sharing cancer genomics data
- Integration with electronic health record systems
The fact that researchers can download and start using PanMAN today, rather than waiting for commercial implementation, substantially shortens the path from publication to practical application.
Conclusion
The public availability of PanMAN software and datasets represents more than just code and data—it provides research infrastructure that can be immediately deployed and tested. The open-source nature ensures that improvements contributed by the global research community will benefit everyone, potentially accelerating adoption and refinement for cancer genomics applications.
For the prostate cancer research community, this means the transformative potential described in the main article isn't a distant future scenario—it's an active, ongoing development with tools available today. Researchers interested in leveraging PanMAN for prostate cancer studies can begin experimenting immediately, adapting the methods demonstrated on pathogen genomes to cancer genomics applications.
The combination of dramatic compression performance (up to 3,000×), rich biological information preservation (phylogenies, mutation histories, ancestral sequences), and immediate availability positions PanMAN as potentially transformative infrastructure for the next generation of genomic medicine.
References
[1] Dentro, S. C., et al. "Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes." Cell 184.8 (2021): 2239-2254. https://doi.org/10.1016/j.cell.2021.03.009
[2] Schumacher, F. R., et al. "Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci." Nature Genetics 50.7 (2018): 928-936. https://doi.org/10.1038/s41588-018-0142-8
Additional Resources
GitHub Repository:
https://github.com/TurakhiaLab/panman
Zenodo Dataset Archive:
https://zenodo.org/records/17781629
DOI: 10.5281/zenodo.17781629
Nature Genetics Publication:
https://doi.org/10.1038/s41588-025-02478-7
Turakhia Lab:
https://turakhia.ucsd.edu
UC San Diego Announcement:
https://today.ucsd.edu/story/compressed-data-technique-enables-pangenomics-at-scale
About the Author: Stephen L. Pendergast is a Senior Engineer Scientist with over 20 years of experience in radar systems engineering, signal processing, and aerospace defense applications. He holds an MS in Electrical Engineering from MIT and a BS from University of Maryland. As an 11-year prostate cancer patient, he contributes technical analysis to the Informed Prostate Cancer Support Group (IPCSG) newsletter.
Technical Supplement Version: 1.0
Date: January 2026
Prepared for: IPCSG Newsletter
This technical supplement accompanies "Data Compression Breakthrough Could Transform Prostate Cancer Genomics: What Patients Need to Know" and "The JPEG Moment for Genomic Medicine" sidebar in the IPCSG Newsletter, January 2026.
Comments
Post a Comment