TECHNICAL SUPPLEMENT: PanMAN Software and Dataset Resources

Open-Source Implementation and Public Datasets Now Available

By Stephen L. Pendergast

Overview

The PanMAN (Pangenome Mutation-Annotated Network) compression technology described in the main article is not merely a theoretical advance—it is a fully implemented, open-source software system with publicly available datasets that researchers can use immediately. This technical supplement provides details on the software implementation, available datasets, and how researchers can begin working with PanMAN today.

Software Repository

GitHub Implementation

Repository: https://github.com/TurakhiaLab/panman
Language: C++
License: Open source
Development Status: Active
Maintainer: Turakhia Lab, UC San Diego

The GitHub repository contains:

Complete PanMAN format specification
C++ implementation of the core algorithms
panmanUtils toolkit for common pangenomic analyses
Documentation and usage examples
Conversion utilities for interoperability with existing formats

panmanUtils Toolkit

The software includes utilities for:

Format conversion: Import/export to VCF, FASTA, MAF, and other standard formats
Tree manipulation: Extract, modify, and analyze phylogenetic trees
Mutation analysis: Query mutation patterns and evolutionary histories
Sequence reconstruction: Derive ancestral sequences at any phylogenetic node
Alignment generation: Produce multiple sequence alignments from PanMANs
Visualization: Generate graphical representations of evolutionary relationships

This interoperability is critical—researchers don't need to abandon existing tools and workflows to benefit from PanMAN compression.

Public Dataset Repository

Zenodo Archive

DOI: 10.5281/zenodo.17781629
URL: https://zenodo.org/records/17781629
Published: December 1, 2025 (Version v5)
Total Size: 45.5 GB
License: Creative Commons Attribution 4.0 International

The Zenodo repository provides both compressed PanMAN files and, for comparison purposes, traditionally-compressed alignment files.

Available Datasets

1. SARS-CoV-2 (Coronavirus) Datasets

Large-Scale Dataset (8 Million Genomes)

File: sars_8M.panman
Size: 382.7 MB
Genomes: ~8 million SARS-CoV-2 sequences
Genome size: ~30,000 base pairs each
Source: Global pandemic surveillance data
Coverage: Multiple variants including Alpha, Beta, Gamma, Delta, Omicron lineages
Compression ratio: ~3,000× vs. uncompressed alignment
MD5 checksum: 9b91da25c589d0ec385bcc57bc14c6ae

Comparison File:

File: sars_8M.aln.gz.xz
Size: 44.7 GB
Format: Multiple sequence alignment, compressed with gzip and xz
Compression ratio: ~22× vs. uncompressed
Direct comparison: PanMAN achieves 117× better compression than this traditionally-compressed alignment while encoding additional evolutionary information

Metadata File:

File: SARS-CoV-2_RefSeqIDs_8M.csv
Size: 91.2 MB
Contents: Reference sequence identifiers mapping to genome accessions in public databases

This dataset represents the largest pangenome ever assembled for SARS-CoV-2 and demonstrates PanMAN's capability to handle millions of genomes efficiently.

Medium-Scale Dataset (20,000 Genomes)

File: sars_20000.panman
Size: 2.4 MB
Genomes: 20,000 SARS-CoV-2 sequences
Purpose: Smaller dataset for testing and development
MD5 checksum: 64deffe72f01893d76b1cd256668f307

2. HIV (Human Immunodeficiency Virus)

File: HIV_20000.panman
Size: 13.1 MB
Genomes: 20,000 HIV sequences
Genome size: ~9,700 base pairs
Significance: HIV's high mutation rate and extensive recombination provide a challenging test case for PanMAN's ability to handle complex evolutionary histories
MD5 checksum: 8b229e076f800a1dd0185f4fa4d04a3b

HIV presents unique challenges due to:

Rapid evolution (~1% sequence divergence per year)
Frequent recombination events
High within-host diversity
Multiple subtypes and circulating recombinant forms

PanMAN's network structure (connecting multiple trees with recombination edges) is particularly well-suited for HIV genomics.

3. RSV (Respiratory Syncytial Virus)

File: rsv_4000.panman
Size: 725.7 KB
Genomes: 4,000 RSV sequences
Genome size: ~15,000 base pairs
Significance: Important respiratory pathogen, particularly in infants and elderly
MD5 checksum: e9e6ef589c2ae9b96814823d0a20ed55

The small file size (under 1 MB for 4,000 genomes) demonstrates exceptional compression on this dataset.

4. Mycobacterium tuberculosis (TB)

File: tb_400.panman
Size: 5.4 MB
Genomes: 400 M. tuberculosis sequences
Genome size: ~4.4 million base pairs
Significance: Large bacterial genome with clinical importance for tracking drug resistance
MD5 checksum: 9685b7f03a8038b175655e1330ace496

M. tuberculosis has:

Much larger genome than viruses (~4.4 Mb vs. ~10-30 Kb)
Slower mutation rate (more conserved sequences)
Important for studying antibiotic resistance evolution

5. Escherichia coli (E. coli)

File: ecoli_1000.panman
Size: 106.8 MB
Genomes: 1,000 E. coli sequences
Genome size: ~4.6-5.5 million base pairs
Significance: Model organism with extensive genomic diversity across strains
MD5 checksum: 8747663e62e6a9a331e4efbcf88043e6

E. coli demonstrates:

Large bacterial genome
Significant structural variation between strains
Frequent horizontal gene transfer
Variable genome size due to accessory genes

6. Klebsiella pneumoniae

File: klebs_1000.panman
Size: 210.6 MB
Genomes: 1,000 Klebsiella sequences
Genome size: ~5.3-5.8 million base pairs
Significance: Opportunistic pathogen with increasing antibiotic resistance
MD5 checksum: 4b7ea6bcf96106fb74d184d7b2b355b3

Klebsiella is notable for:

Large genome with extensive accessory elements
Multiple antibiotic resistance plasmids
High rates of horizontal gene transfer
Clinically important hospital-acquired infections

Compression Performance Summary

Dataset	Genomes	Genome Size	PanMAN Size	Compression Ratio*	Per-Genome Storage
SARS-CoV-2	8,000,000	~30 Kb	382.7 MB	~3,000×	~48 bytes
SARS-CoV-2	20,000	~30 Kb	2.4 MB	~250×	~120 bytes
HIV	20,000	~9.7 Kb	13.1 MB	~15×	~655 bytes
RSV	4,000	~15 Kb	725.7 KB	~83×	~181 bytes
M. tuberculosis	400	~4.4 Mb	5.4 MB	~325×	~13.5 KB
E. coli	1,000	~5 Mb	106.8 MB	~47×	~107 KB
Klebsiella	1,000	~5.5 Mb	210.6 MB	~26×	~211 KB

*Compression ratio calculated vs. uncompressed multiple sequence alignment format

Key observations:

Compression improves dramatically with dataset size (note SARS-CoV-2 8M vs. 20K)
More conserved organisms (sharing more recent common ancestry) compress better
Even bacterial genomes 100-500× larger than viral genomes achieve substantial compression
Per-genome storage decreases as dataset size increases due to shared ancestry exploitation

Technical Specifications

PanMAN File Format

PanMAN uses a binary format optimized for:

Fast random access to specific sequences
Efficient mutation queries
Rapid phylogenetic tree traversal
Minimal memory overhead for large datasets

The format stores:

Root sequence: Single ancestral genome at the tree root
Tree topology: Phylogenetic relationships between genomes
Branch annotations: Mutations (SNPs, insertions, deletions) on each branch
Network edges: Recombination and horizontal gene transfer events
Metadata: Sample identifiers, collection dates, geographic locations (optional)

Compatibility and Interoperability

PanMAN is designed to integrate with existing genomics workflows:

Input formats accepted:

FASTA (sequences)
Newick (phylogenetic trees)
VCF (variant calls)
MAF (multiple alignment format)

Output formats generated:

FASTA (reconstructed sequences)
VCF (variant calls)
Newick (phylogenetic trees)
MAF (multiple alignments)
Custom visualization formats

Compatible with:

Standard phylogenetic inference tools (IQ-TREE, RAxML, FastTree)
Variant calling pipelines (GATK, bcftools)
Multiple alignment tools (MAFFT, MUSCLE, Clustal)
Visualization software (FigTree, Nextstrain, Phandango)

System Requirements

For analysis of provided datasets:

Modern x86-64 processor
RAM: 4-16 GB (scales with dataset size)
Storage: Minimal (datasets are highly compressed)
OS: Linux, macOS, or Windows with WSL

For creating new PanMANs:

Additional RAM for alignment and tree construction
Computational requirements depend on:
- Number of genomes
- Genome size
- Phylogenetic inference method
- Alignment algorithm

The TWILIGHT tool (also from Turakhia Lab) is recommended for constructing alignments of millions of genomes, as it's specifically optimized for this scale.

Usage Statistics and Community Adoption

As of January 2026:

Total repository views: 754
Total downloads: 1,622
Data volume transferred: 5.8 TB
Active since: December 2025

These statistics indicate significant early interest from the research community even before the Nature Genetics publication date (January 12, 2026), suggesting that researchers have been testing and validating the technology.

Relevance to Prostate Cancer Research

While the current public datasets focus on pathogens (for which millions of genomes exist), the principles and software directly apply to cancer genomics:

Tumor Evolution Studies

Cancer genomes from a single patient share a recent common ancestor (the original transformed cell) and diverge through somatic mutation—exactly the evolutionary pattern PanMAN is designed to compress. Multiple biopsies or metastases from one patient could be efficiently stored and analyzed.

Multi-Region Sampling

Comprehensive tumor profiling increasingly involves sequencing 10-50 regions within a single tumor to map intratumor heterogeneity.[1] PanMAN's compression of related genomes would be ideal for storing these multi-region datasets.

Longitudinal Monitoring

Repeated liquid biopsy sampling over months or years of treatment generates time-series genomic data. PanMAN's efficient storage of evolutionary trajectories naturally represents this temporal data.

Population Studies

Large-scale germline sequencing studies (like the PRACTICAL consortium's 140,000+ prostate cancer patients)[2] could benefit from PanMAN's compression of related genomes, particularly when analyzing inherited haplotypes shared across families.

Clinical Trial Databases

Pharmaceutical companies and trial networks accumulating genomic data from thousands of patients in multiple trials could use PanMAN to:

Reduce storage costs
Accelerate comparative analyses
Share data more efficiently between institutions
Preserve complete evolutionary context for resistance studies

Getting Started: Practical Guide for Researchers

Step 1: Download Software and Datasets

# Clone the GitHub repository
git clone https://github.com/TurakhiaLab/panman.git
cd panman

# Compile the software (requires C++ compiler)
make

# Download a test dataset from Zenodo
wget https://zenodo.org/records/17781629/files/rsv_4000.panman

Step 2: Basic Operations

# Extract summary information
./panman info rsv_4000.panman

# Export sequences to FASTA format
./panman extract rsv_4000.panman --output rsv_sequences.fasta

# Generate phylogenetic tree
./panman tree rsv_4000.panman --output rsv_tree.newick

# Query mutations on specific branches
./panman mutations rsv_4000.panman --node NODE_ID

# Extract ancestral sequence
./panman ancestral rsv_4000.panman --node NODE_ID

Step 3: Create Your Own PanMAN

# Starting from aligned sequences and a phylogenetic tree
./panman create \
  --alignment your_sequences.fasta \
  --tree your_tree.newick \
  --output your_data.panman

Detailed documentation and tutorials are available in the GitHub repository.

Integration with TWILIGHT Alignment Tool

The Turakhia Lab also developed TWILIGHT, a fast multiple sequence alignment tool optimized for very large datasets (millions of sequences). TWILIGHT and PanMAN are designed to work together:

TWILIGHT → Alignment → PanMAN

TWILIGHT constructs the multiple sequence alignment from raw sequences
Phylogenetic relationships are inferred (using standard tools or TWILIGHT's built-in methods)
PanMAN ingests the alignment and tree, creating the compressed representation

This pipeline enabled the construction of the 8 million SARS-CoV-2 genome pangenome—a scale that would have been computationally prohibitive with conventional tools.

Citation and Attribution

For the PanMAN Method:

Walia, S., Motwani, H., Tseng, Y.H., et al. "Compressive pangenomics using mutation-annotated networks." Nature Genetics (2026). https://doi.org/10.1038/s41588-025-02478-7

For the Software and Datasets:

Walia, S., Turakhia, Y., & Tseng, Y.H. (2025). Pangenome Mutation-Annotated Networks [Data set]. Zenodo. https://doi.org/10.5281/zenodo.17781629

Software Repository:

Turakhia Lab. "PanMAN: Pangenome Mutation-Annotated Networks." GitHub. https://github.com/TurakhiaLab/panman

Future Developments

According to the research team, ongoing development includes:

Human genome optimization: Extending PanMAN to efficiently handle the larger size and structural complexity of human genomes (3 billion base pairs vs. 10-30 thousand for viruses)
Cancer genome specialization: Adaptations for tumor genomics, including handling somatic mutations, copy number variations, and structural rearrangements
GPU acceleration: Parallel processing implementations to speed up pangenome construction and queries
Cloud integration: Native support for cloud storage and distributed computing platforms
Visualization tools: Interactive browsers for exploring evolutionary relationships in compressed pangenomes
Privacy-preserving features: Encrypted PanMANs that support queries on encrypted data without decryption

Technical Support and Community

GitHub Issues: Bug reports, feature requests, and technical questions
Documentation: Comprehensive guides in the repository wiki
Turakhia Lab Website: https://turakhia.ucsd.edu (contact information and updates)

Implications for Prostate Cancer Research Community

The availability of open-source, production-ready PanMAN software with extensive documentation and public datasets significantly accelerates the potential timeline for impact on prostate cancer research:

Near-term (2026-2027):

Pilot projects adapting PanMAN for cancer genomics
Integration with existing cancer genomics pipelines
Proof-of-concept studies on prostate cancer datasets

Medium-term (2027-2029):

Clinical trial networks adopting PanMAN for data management
Large cancer genomics consortia testing at scale
Publications demonstrating cancer genomics applications

Long-term (2029+):

Routine use in clinical genomics laboratories
Standard format for sharing cancer genomics data
Integration with electronic health record systems

The fact that researchers can download and start using PanMAN today, rather than waiting for commercial implementation, substantially shortens the path from publication to practical application.

Conclusion

The public availability of PanMAN software and datasets represents more than just code and data—it provides research infrastructure that can be immediately deployed and tested. The open-source nature ensures that improvements contributed by the global research community will benefit everyone, potentially accelerating adoption and refinement for cancer genomics applications.

For the prostate cancer research community, this means the transformative potential described in the main article isn't a distant future scenario—it's an active, ongoing development with tools available today. Researchers interested in leveraging PanMAN for prostate cancer studies can begin experimenting immediately, adapting the methods demonstrated on pathogen genomes to cancer genomics applications.

The combination of dramatic compression performance (up to 3,000×), rich biological information preservation (phylogenies, mutation histories, ancestral sequences), and immediate availability positions PanMAN as potentially transformative infrastructure for the next generation of genomic medicine.

References

[1] Dentro, S. C., et al. "Characterizing genetic intra-tumor heterogeneity across 2,658 human cancer genomes." Cell 184.8 (2021): 2239-2254. https://doi.org/10.1016/j.cell.2021.03.009

[2] Schumacher, F. R., et al. "Association analyses of more than 140,000 men identify 63 new prostate cancer susceptibility loci." Nature Genetics 50.7 (2018): 928-936. https://doi.org/10.1038/s41588-018-0142-8

Additional Resources

GitHub Repository:
https://github.com/TurakhiaLab/panman

Zenodo Dataset Archive:
https://zenodo.org/records/17781629
DOI: 10.5281/zenodo.17781629

Nature Genetics Publication:
https://doi.org/10.1038/s41588-025-02478-7

Turakhia Lab:
https://turakhia.ucsd.edu

UC San Diego Announcement:
https://today.ucsd.edu/story/compressed-data-technique-enables-pangenomics-at-scale

About the Author: Stephen L. Pendergast is a Senior Engineer Scientist with over 20 years of experience in radar systems engineering, signal processing, and aerospace defense applications. He holds an MS in Electrical Engineering from MIT and a BS from University of Maryland. As an 11-year prostate cancer patient, he contributes technical analysis to the Informed Prostate Cancer Support Group (IPCSG) newsletter.

Technical Supplement Version: 1.0
Date: January 2026
Prepared for: IPCSG Newsletter

This technical supplement accompanies "Data Compression Breakthrough Could Transform Prostate Cancer Genomics: What Patients Need to Know" and "The JPEG Moment for Genomic Medicine" sidebar in the IPCSG Newsletter, January 2026.

Panman technical supplement