Bit-Reproducible Phylogenetic Tree Inference under Varying Core-Counts via Reproducible Parallel Reduction Operators

Stelz, C.Open in Google Scholar•Huebner, L.Open in Google Scholar•Stamatakis, A.Open in Google Scholar

Motivation: Phylogenetic trees describe the evolutionary history among biological species based on their genomic data. Maximum Likelihood (ML) based phylogenetic inference tools search for the tree and evolutionary model that best explain the observed genomic data. Given the independence of likelihood score calculations between different genomic sites, parallel computation is commonly deployed. This is followed by a parallel summation over the per-site scores to obtain the overall likelihood score of the tree. However, basic arithmetic operations on IEEE 754 floating-point numbers, such as addition and multiplication, inherently introduce rounding errors. Consequently, the order by which floating-point operations are executed affects the exact resulting likelihood value since these operations are not associative. Moreover, parallel reduction algorithms in numerical codes re-associate operations as a function of the core count and cluster network topology, inducing different round-off errors. These low-level deviations can cause heuristic searches to diverge and induce high-level result discrepancies (e.g., yield topologically distinct phylogenies). This effect has also been observed in multiple scientific fields, beyond phylogenetics. Results: We observe that varying the degree of parallelism results in diverging phylogenetic tree searches (high level results) for over 31 % out of 10 130 empirical datasets. More importantly, 8 % of these diverging datasets yield trees that are statistically significantly worse than the best known ML tree for the dataset (AU-test, p < 0.05). To alleviate this, we develop a variant of the widely used phylogenetic inference tool RAxML-NG, which does yield bit-reproducible results under varying core-counts, with a slowdown of only 0 to 12.7 % (median 0.8 %) on up to 768 cores. We further introduce the ReproRed reduction algorithm, which yields bit-identical results under varying core-counts, by maintaining a fixed operation order that is independent of the communication pattern. ReproRed is thus applicable to all associative reduction operations - in contrast to competitors, which are confined to summation. Our ReproRed reduction algorithm only exchanges the theoretical minimum number of messages, overlaps communication with computation, and utilizes fast base-cases for local reductions. ReproRed is able to all-reduce (via a subsequent broadcast) 4.1 {middle dot} 106 operands across 48 to 768 cores in 19.7 to 48.61 {micro}s, thereby exhibiting a slowdown of 13 to 93 % over a non-reproducible all-reduce algorithm. ReproRed outperforms the state-of-the-art reproducible all-reduction algorithm ReproBLAS (offers summation only) beyond 10 000 elements per core. In summary, we re-assess non-reproducibility in parallel phylogenetic inference, present the first bit-reproducible parallel phylogenetic inference tool, as well as introduce a general algorithm and open-source code for conducting reproducible associative parallel reduction operations.

Bit-Reproducible Phylogenetic Tree Inference under Varying Core-Counts via Reproducible Parallel Reduction Operators

Similar Papers