2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
June 5th, 2025
Version: 1
Karlsruher Institute of Technology
bioinformatics
biorxiv

Bit-Reproducible Phylogenetic Tree Inference under Varying Core-Counts via Reproducible Parallel Reduction Operators

Stelz, C.Open in Google Scholar•Huebner, L.Open in Google Scholar•Stamatakis, A.Open in Google Scholar

Motivation: Phylogenetic trees describe the evolutionary history among biological species based on their genomic data. Maximum Likelihood (ML) based phylogenetic inference tools search for the tree and evolutionary model that best explain the observed genomic data. Given the independence of likelihood score calculations between different genomic sites, parallel computation is commonly deployed. This is followed by a parallel summation over the per-site scores to obtain the overall likelihood score of the tree. However, basic arithmetic operations on IEEE 754 floating-point numbers, such as addition and multiplication, inherently introduce rounding errors. Consequently, the order by which floating-point operations are executed affects the exact resulting likelihood value since these operations are not associative. Moreover, parallel reduction algorithms in numerical codes re-associate operations as a function of the core count and cluster network topology, inducing different round-off errors. These low-level deviations can cause heuristic searches to diverge and induce high-level result discrepancies (e.g., yield topologically distinct phylogenies). This effect has also been observed in multiple scientific fields, beyond phylogenetics. Results: We observe that varying the degree of parallelism results in diverging phylogenetic tree searches (high level results) for over 31 % out of 10 130 empirical datasets. More importantly, 8 % of these diverging datasets yield trees that are statistically significantly worse than the best known ML tree for the dataset (AU-test, p < 0.05). To alleviate this, we develop a variant of the widely used phylogenetic inference tool RAxML-NG, which does yield bit-reproducible results under varying core-counts, with a slowdown of only 0 to 12.7 % (median 0.8 %) on up to 768 cores. We further introduce the ReproRed reduction algorithm, which yields bit-identical results under varying core-counts, by maintaining a fixed operation order that is independent of the communication pattern. ReproRed is thus applicable to all associative reduction operations - in contrast to competitors, which are confined to summation. Our ReproRed reduction algorithm only exchanges the theoretical minimum number of messages, overlaps communication with computation, and utilizes fast base-cases for local reductions. ReproRed is able to all-reduce (via a subsequent broadcast) 4.1 {middle dot} 106 operands across 48 to 768 cores in 19.7 to 48.61 {micro}s, thereby exhibiting a slowdown of 13 to 93 % over a non-reproducible all-reduce algorithm. ReproRed outperforms the state-of-the-art reproducible all-reduction algorithm ReproBLAS (offers summation only) beyond 10 000 elements per core. In summary, we re-assess non-reproducibility in parallel phylogenetic inference, present the first bit-reproducible parallel phylogenetic inference tool, as well as introduce a general algorithm and open-source code for conducting reproducible associative parallel reduction operations.

Similar Papers

biorxiv
Fri Jun 06 2025
SCNT: An R Package for Data Analysis and Visualization of Single-Cell and Spatial Transcriptomics
Background: The emergence of single-cell (SC) and spatial transcriptomics (ST) has revolutionized our understanding of gene expression dynamics in complex tissues. However, it also presents challenges for data analysis and visualization, particularly due to the complexity of ST data and the diversity of analysis platforms. The SCNT (Single-Cell, Single-Nucleus, and Spatial Transcriptomics Analysis...
Qing, J.
•
Wu, J.
•
Li, Y.
•
Wu, J.
biorxiv
Fri Jun 06 2025
OriGene: A Self-Evolving Virtual Disease Biologist Automating Therapeutic Target Discovery
Therapeutic target discovery remains a critical yet intuition-driven bottleneck in drug development, typically relying on disease biologists to laboriously integrate diverse biomedical data into testable hypotheses for experimental validation. Here, we present OriGene, a self-evolving multi-agent system that functions as a virtual disease biologist, systematically identifying original and mechanis...
Zhang, Z.
•
Qiu, Z.
•
Wu, Y.
•
Li, S.
...•
Zheng, S.
biorxiv
Fri Jun 06 2025
Amira: detection of AMR genes directly from long reads using gene-space de Bruijn graphs
Accurate detection of antimicrobial resistance (AMR) genes is essential for the surveillance, epidemiology and genotypic prediction of AMR. This is typically done by generating an assembly from the sequencing reads of a bacterial isolate and running AMR gene detection tools on the assembly. However, despite advances in long-read sequencing that have greatly improved the quality and completeness of...
Anderson, D.
•
Lima, L.
•
Le, T.
•
Judd, L. M.
...•
Iqbal, Z.
biorxiv
Fri Jun 06 2025
Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein unknome. This large knowledge shortfall is one of the final frontiers of biology. Machine-Learning (ML) approaches are enticing, with early successes demonstrating the ability to propagate functional knowledge from experimentally characterized proteins. An open question is the a...
de Crecy-Lagard, V.
•
Dias, R.
•
Sexson, N.
•
Friedberg, I.
...•
Swairjo, M.
biorxiv
Fri Jun 06 2025
An improved model for prediction of de novo designed proteins with diverse geometries
Nature uses structural variations on protein folds to fine-tune the geometries of proteins for diverse functions, yet deep learning-based de novo protein design methods generate highly regular, idealized protein fold geometries that fail to capture natural diversity. Here, using physics-based design methods, we generated and experimentally validated a dataset of 5,996 stable, de novo designed prot...
Orr, B.
•
Crilly, S. E.
•
Akpinaroglu, D.
•
Zhu, E.
...•
Kortemme, T.
biorxiv
Fri Jun 06 2025
Pangenome-aware DeepVariant
Population-scale genomics information provides valuable prior knowledge for various genomic analyses, especially variant calling. A notable example of such application is the human pangenome reference released by the Human Pangenome Reference Consortium, which has been shown to improve read mapping and structural variant genotyping. In this work, we introduce pangenome-aware DeepVariant, a variant...
Asri, M.
•
Chang, P.-C.
•
Mier, J. C.
•
Siren, J.
...•
Shafin, K.
biorxiv
Fri Jun 06 2025
sCIN: A Contrastive Learning Framework for Single-Cell Multi-omics Data Integration
The rapid advancement of single-cell omics technologies such as scRNA-seq and scATAC-seq has transformed our understanding of cellular heterogeneity and regulatory mechanisms. However, integrating these data types remains challenging due to distributional discrepancies and distinct feature spaces. To address this, we present a novel single-cell Contrastive INtegration framework (sCIN), that integr...
Ebrahimi, A.
•
Siahpirani, A. F.
•
Montazeri, H.
biorxiv
Fri Jun 06 2025
Global profiling of the proteome and acetylome in mice with abdominal aortic aneurysms
Objective: Abdominal Aortic Aneurysm (AAA) is a life-threatening vascular disease with a high risk of rupture. Current treatments rely on surgery, as effective drug therapies remain unavailable due to limited understanding of disease mechanisms and a lack of therapeutic targets. This study aims to identify potential targets for pharmacological intervention through global proteomic and acetylomic a...
Yang, J.
•
Zhang, L.
•
Yang, B.
•
Ding, T.
...•
Liu, J.
biorxiv
Thu Jun 05 2025
Machine learning driven acceleration of biopharmaceutical formulation development using Excipient Prediction Software (ExPreSo)
Formulation development of protein biopharmaceuticals has become increasingly challenging due to new modalities and higher desired drug substance concentrations. The constraint in drug substance supply and the need for many analytical methods means that only a small selection of excipients can be thoroughly tested in the lab. There are few in-silico tools developed to refine the candidate excipien...
Vidal-Henriquez, E.
•
Holder, T.
•
Lee, N. F.
•
Pompe, C.
•
Teese, M. G.
biorxiv
Thu Jun 05 2025
Learning Genetic Perturbation Effects with Variational Causal Inference
Advances in sequencing technologies have enhanced the understanding of gene regulation in cells. In particular, Perturb-seq has enabled high-resolution profiling of the transcriptomic response to genetic perturbations at the single-cell level. This understanding has implications in functional genomics and potentially for identifying therapeutic targets. Various computational models have been devel...
Liu, E.
•
Zhang, J.
•
Uhler, C.