Compressed Data Technique Enables Pangenomics at Scale
Compressed Data Technique Enables Pangenomics at Scale
BLUF (Bottom Line Up Front)
UC San Diego engineers have developed PanMAN (Pangenome Mutation-Annotated Network), a revolutionary data compression technique that reduces genetic data storage by up to 3,000-fold while preserving complete evolutionary histories. This breakthrough enables analysis of millions of genomes simultaneously, addressing a critical bottleneck in pangenomics that has limited large-scale studies of genetic variation, disease evolution, and drug resistance.
The Compression Revolution: How New Data Architecture Could Transform Genomic Medicine
A Storage Crisis Threatens the Genomics Era
The cost of sequencing a human genome has plummeted from $100 million in 2001 to under $1,000 today, fulfilling the promises of the Human Genome Project era.[1] Yet this technological triumph has created an unexpected crisis: the flood of genetic data threatens to overwhelm our ability to store, share, and analyze it. A single human genome requires approximately 200 gigabytes of raw sequencing data, and major biobanks now harbor millions of genomes.[2] The UK Biobank alone aims to sequence 5 million genomes, generating over a petabyte of data.[3]
Enter a new solution from the University of California San Diego that tackles this storage bottleneck with an elegance reminiscent of early data compression pioneers. Publishing in Nature Genetics on January 12, 2026, electrical and computer engineering professor Yatish Turakhia and his team introduced PanMAN (Pangenome Mutation-Annotated Network), a data structure that achieves compression ratios previously thought impossible while simultaneously encoding richer biological information than existing formats.[4]
Beyond Simple Compression: Encoding Evolutionary History
What distinguishes PanMAN from conventional compression approaches is its biological intelligence. Rather than treating genomes as arbitrary strings of letters to be compressed through pattern recognition, PanMAN exploits the fundamental reality that related organisms share evolutionary ancestry. The system stores a single ancestral genome sequence and annotates only the mutations that distinguish descendants—substitutions, insertions, and deletions—on the branches where they arose.
"The data structures used for pangenomics research are critical because they determine not only how efficiently genetic data is represented, but also what the data can represent," explains Sumit Walia, an electrical engineering PhD candidate at UC San Diego's Jacobs School of Engineering and co-first author of the study.[4]
This approach builds upon decades of phylogenetic research. The field of molecular phylogenetics has long recognized that evolutionary relationships contain compressible information—closely related species share most of their genetic sequence, differing only in mutations accumulated since their common ancestor.[5] PanMAN operationalizes this insight through a network of mutation-annotated trees (PanMATs) connected by edges that encode complex genetic events like recombination and horizontal gene transfer.
Unprecedented Scale: 8 Million Viral Genomes in 366 Megabytes
The practical impact becomes clear in the team's demonstration with SARS-CoV-2, the virus responsible for COVID-19. The pandemic generated an unprecedented genomic surveillance effort, with researchers worldwide sequencing millions of viral samples to track variants and evolution.[6] The GISAID database, the primary repository for SARS-CoV-2 sequences, contains over 16 million genomes as of 2024.[7]
Turakhia's team constructed the largest SARS-CoV-2 pangenome to date, incorporating more than 8 million separate viral genomes. Using PanMAN, this massive dataset occupies just 366 megabytes of storage—roughly 3,000 times less than the corresponding whole-genome alignment it encodes.[4] For context, this means the complete evolutionary record of 8 million SARS-CoV-2 genomes requires less storage than a typical smartphone photograph.
This compression ratio vastly exceeds existing pangenome formats. Graph-based pangenome representations, which have become popular in recent years, typically achieve 10-100 fold compression by eliminating redundancy in sequences.[8] However, these formats sacrifice phylogenetic information, representing only genetic variation rather than evolutionary and mutational histories.
The Architecture of Biological Information
PanMAN's architecture reflects sophisticated understanding of how biological information is structured. The system explicitly stores mutations, phylogenies, annotations, and the root ancestral sequence. From this foundation, researchers can derive additional layers of information: ancestral sequences at any point in evolutionary history, multiple whole-genome alignments, and complete catalogs of genetic variation.[4]
This representational richness addresses limitations that have frustrated researchers working with current pangenome formats. The Variation Graph (VG) toolkit, developed by the UCSC Genomics Institute and widely adopted in human pangenomics, efficiently represents variation but cannot easily reconstruct evolutionary relationships.[9] The Genome Reference Consortium's human reference graph similarly captures structural variation while omitting phylogenetic context.[10]
"Our compressive technique with PanMANs allows doing more with less, greatly improving the scale and scope of current pangenomic analysis," said Turakhia.[4]
From Microbes to Human Populations
The research team's initial applications have focused on microbial genomes, where PanMAN has already demonstrated transformative potential. Bacterial pangenomics has become crucial for understanding antibiotic resistance, with researchers tracking how resistance genes spread through horizontal gene transfer.[11] PanMAN's ability to encode these complex genetic events in its network structure while maintaining compression makes it particularly well-suited for microbial studies.
Now the team is extending their approach to human genomes, supported by a Jacobs School Early Career Faculty Development Award. Turakhia is collaborating with Melissa Gymrek, a UC San Diego professor of computer science and engineering whose research focuses on human genetic variation and disease.[4]
The transition from microbial to human genomes presents new challenges. Human genomes are approximately 1,000 times larger than bacterial genomes and contain far more structural variation—large insertions, deletions, and rearrangements that complicate compression.[12] However, the shared ancestry among human populations also creates opportunities for compression, as any two humans share approximately 99.9% of their genetic sequence.[13]
"Extending compressive pangenomics to human genomes can fundamentally transform how we store, analyze, and share large-scale human genetic data," said Turakhia. "Besides enabling studies of human genetic diversity, disease, and evolution at unprecedented scale and speed, it can depict detailed evolutionary and mutational histories which shape diverse human populations, something that current representations do not capture."[4]
The TWILIGHT Connection
Constructing the massive SARS-CoV-2 pangenome required another innovation from Turakhia's laboratory: TWILIGHT, a computational tool designed for rapid multiple genome alignment at unprecedented scale. Traditional multiple sequence alignment algorithms scale poorly, with computational requirements growing exponentially as more sequences are added.[14]
TWILIGHT addresses this scalability challenge through algorithmic innovations that exploit the phylogenetic relationships among sequences. By aligning genomes in phylogenetic order and propagating alignment information through evolutionary trees, TWILIGHT achieves dramatic speedups compared to conventional approaches.[4] Yu-Hsiang Tseng, the lead developer of TWILIGHT and a co-author of the Nature Genetics study, integrated the tool seamlessly with PanMAN to enable the construction of million-genome pangenomes.
The synergy between TWILIGHT and PanMAN illustrates a broader trend in computational genomics: the recognition that biological context—evolutionary relationships, functional constraints, structural organization—can inform algorithmic design in ways that pure computer science approaches cannot match.
Implications for Precision Medicine and Beyond
The potential applications of compressive pangenomics extend far beyond data storage. The National Institutes of Health's All of Us Research Program aims to build one of the world's largest precision medicine databases, collecting genomic and health data from one million Americans.[15] The program has already enrolled over 750,000 participants and faces mounting storage and analysis challenges as sequencing ramps up.[16]
PanMAN could enable the All of Us program and similar initiatives to store complete genomic data for millions of individuals while simultaneously preserving the evolutionary and mutational context necessary for understanding disease. Cancer genomics provides a particularly compelling use case. Tumors evolve through somatic mutation, creating phylogenetic trees of cancer cell lineages within individual patients.[17] PanMAN's ability to encode these evolutionary relationships while compressing the data could accelerate cancer research and personalized treatment.
The technique also has implications for genomic privacy, an increasingly pressing concern as genetic databases grow.[18] Smaller, more efficient data structures reduce the attack surface for privacy breaches and make it more practical to implement privacy-preserving computational techniques like homomorphic encryption, which operates on encrypted data without decryption.[19]
The Broader Context of Genomic Data Science
PanMAN emerges at a critical juncture for genomics. The field has shifted from the single-reference paradigm that dominated early genomics toward pangenomic approaches that embrace natural variation.[20] The Human Pangenome Reference Consortium, an international collaboration, released its first complete human pangenome in 2023, representing 47 genetically diverse individuals rather than a single composite reference.[21] This shift recognizes that using a single reference genome introduces biases, particularly affecting populations underrepresented in genomic research.
However, pangenomic approaches create new computational challenges. Graph-based pangenome representations require specialized tools and expertise that have slowed their adoption.[22] PanMAN's compression capabilities could ease this transition by making pangenomes more manageable while its rich information content supports diverse analytical approaches.
The technique also resonates with broader trends in data science toward more semantically meaningful representations. Modern machine learning increasingly incorporates domain knowledge rather than relying purely on statistical pattern recognition.[23] PanMAN exemplifies this approach by encoding biological principles—shared ancestry, mutational processes, phylogenetic relationships—directly into data structure design.
Future Directions and Open Questions
Several questions remain as compressive pangenomics moves from proof-of-concept to widespread adoption. How well will the approach scale to complex eukaryotic genomes with extensive structural variation? Can the technique be adapted for real-time analysis in clinical settings? How will it integrate with existing genomic analysis pipelines and tools?
The Turakhia laboratory is actively pursuing these questions. Their collaboration with Gymrek specifically targets human genetic diversity and disease, domains where compressive pangenomics could have immediate clinical impact. The team is also exploring applications in agricultural genomics, where understanding crop genetic diversity is crucial for food security.[4]
The success of PanMAN may also inspire similar approaches in other data-intensive fields. Protein structure databases, medical imaging archives, and environmental monitoring systems all face analogous challenges: exponentially growing data volumes that threaten to overwhelm storage and analysis capabilities. The core insight—that domain-specific structure can enable dramatic compression while enriching rather than degrading information content—transcends genomics.
Conclusion
The development of PanMAN represents more than an incremental advance in data compression. It exemplifies how deep understanding of domain-specific structure can yield algorithmic innovations that seemed impossible under conventional approaches. By recognizing that genomes are not random strings but records of evolutionary history, Turakhia and his colleagues have created a data structure that makes the previously impossible routine: analyzing millions of genomes while preserving complete mutational and phylogenetic context.
As genomic medicine moves toward its promise of truly personalized healthcare, compressive pangenomics may prove to be enabling infrastructure as fundamental as the sequencing technologies that created the data deluge. The ability to store, share, and analyze genetic information at population scale while maintaining evolutionary context could accelerate discoveries across medicine, agriculture, and evolutionary biology. In an era when data storage seems like a mundane concern compared to the glamour of AI and precision medicine, PanMAN reminds us that foundational advances in how we represent information can be as transformative as any breakthrough in analysis or application.
Verified Sources with Formal Citations
[1] National Human Genome Research Institute. "The Cost of Sequencing a Human Genome." Genome.gov. Updated 2023. https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost
[2] Stephens, Z. D., et al. "Big Data: Astronomical or Genomical?" PLOS Biology 13.7 (2015): e1002195. https://doi.org/10.1371/journal.pbio.1002195
[3] UK Biobank. "Whole Genome Sequencing." https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/genetic-data/whole-genome-sequencing
[4] Walia, S., Motwani, H., Tseng, Y.H., et al. "Compressive pangenomics using mutation-annotated networks." Nature Genetics (2026). https://doi.org/10.1038/s41588-025-02478-7
[5] Felsenstein, J. "Inferring Phylogenies." Sinauer Associates, 2004.
[6] Hodcroft, E. B., et al. "Spread of a SARS-CoV-2 variant through Europe in the summer of 2020." Nature 595.7869 (2021): 707-712. https://doi.org/10.1038/s41586-021-03677-y
[7] GISAID. "Global Initiative on Sharing All Influenza Data." https://www.gisaid.org (Accessed 2024)
[8] Garrison, E., et al. "Variation graph toolkit improves read mapping by representing genetic variation in the reference." Nature Biotechnology 36.9 (2018): 875-879. https://doi.org/10.1038/nbt.4227
[9] Garrison, E., et al. "Building pangenome graphs." bioRxiv (2023). https://doi.org/10.1101/2023.04.05.535718
[10] Genome Reference Consortium. "Human Reference Genome." https://www.ncbi.nlm.nih.gov/grc/human
[11] von Wintersdorff, C. J., et al. "Dissemination of Antimicrobial Resistance in Microbial Ecosystems through Horizontal Gene Transfer." Frontiers in Microbiology 7 (2016): 173. https://doi.org/10.3389/fmicb.2016.00173
[12] Alkan, C., et al. "Genome structural variation discovery and genotyping." Nature Reviews Genetics 12.5 (2011): 363-376. https://doi.org/10.1038/nrg2958
[13] The 1000 Genomes Project Consortium. "A global reference for human genetic variation." Nature 526.7571 (2015): 68-74. https://doi.org/10.1038/nature15393
[14] Edgar, R. C. "MUSCLE: multiple sequence alignment with high accuracy and high throughput." Nucleic Acids Research 32.5 (2004): 1792-1797. https://doi.org/10.1093/nar/gkh340
[15] All of Us Research Program. "About the Program." National Institutes of Health. https://allofus.nih.gov/about/about-all-us-research-program
[16] All of Us Research Program. "Enrollment Milestones." https://allofus.nih.gov/news-events/announcements/all-us-research-program-reaches-750000-participants (2024)
[17] Greaves, M., and C. C. Maley. "Clonal evolution in cancer." Nature 481.7381 (2012): 306-313. https://doi.org/10.1038/nature10762
[18] Naveed, M., et al. "Privacy in the Genomic Era." ACM Computing Surveys 48.1 (2015): 1-44. https://doi.org/10.1145/2767007
[19] Ayday, E., et al. "Privacy-Preserving Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data." Proceedings of USENIX Security (2013). https://www.usenix.org/conference/usenixsecurity13/technical-sessions/paper/ayday
[20] Computational Pan-Genomics Consortium. "Computational pan-genomics: status, promises and challenges." Briefings in Bioinformatics 19.1 (2018): 118-135. https://doi.org/10.1093/bib/bbw089
[21] Liao, W. W., et al. "A draft human pangenome reference." Nature 617.7960 (2023): 312-324. https://doi.org/10.1038/s41586-023-05896-x
[22] Paten, B., et al. "Genome graphs and the evolution of genome inference." Genome Research 27.5 (2017): 665-676. https://doi.org/10.1101/gr.214155.116
[23] Jumper, J., et al. "Highly accurate protein structure prediction with AlphaFold." Nature 596.7873 (2021): 583-589. https://doi.org/10.1038/s41586-021-03819-2

Comments
Post a Comment