2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
January 22nd, 2025
Version: 1
Harvard Medical School
bioinformatics
biorxiv

Phyla: Towards a Foundation Model for Phylogenetic Inference

Shen, A.Open in Google Scholar•Ektefaie, Y.Open in Google Scholar•Jain, L.Open in Google Scholar•Farhat, M. R.Open in Google Scholar•Zitnik, M.Open in Google Scholar

Deep learning has made strides in modeling protein sequences but often struggles to generalize beyond its training distribution. Current models focus on learning individual sequences through masked language modeling, but effective protein sequence analysis demands the ability to reason across sequences, a critical step in phylogenetic analysis. Training biological foundation models explicitly for inter-sequence reasoning could enhance their generalizability and performance for phylogenetic inference and other tasks in computational biology. Here, we report an ongoing development of Phyla, an architecture that operates on an explicit, higher-level semantic representation of phylogenetic trees. Phyla employs a hybrid state-space transformer architecture and a novel tree loss function to achieve state-of-the-art performance on sequence reasoning benchmarks and phylogenetic tree reconstruction. To validate Phyla\'s capabilities, we applied it to reconstruct the tree of life, where Phyla accurately reclassified archaeal organisms, such as Lokiarchaeota, as more closely related to bacteria-aligning with recent phylogenetic insights. Phyla represents a step toward molecular sequence reasoning, emphasizing structured reasoning over memorization and advancing protein sequence analysis and phylogenetic inference.

Similar Papers

biorxiv
Thu May 08 2025
FlashFold: a standalone command-line tool for accelerated protein structure and stoichiometry prediction
ABSTARCTAlphaFold has revolutionized the decades-old issue of precisely predicting protein structures. However, its high accuracy relies on a computationally intensive step that involves searching vast databases for homologous sequences as the query protein of interest. Additionally, predicting the quaternary structure of protein complexes requires prior knowledge of subunit counts, a prerequisite...
Saha, C. K.
•
Roghanian, M.
•
Häussler, S.
•
Guy, L.
biorxiv
Thu May 08 2025
Not All Saliva Samples Are Equal: The Role of Cellular Heterogeneity in DNA methylation and Epigenetic Age Analyses with Biological and Psychosocial Factors
Saliva is widely used in biomedical population research, including epigenetic analyses to investigate gene-environment interplay and identify biomarkers. Its minimally invasive collection procedure makes it ideal for studies in pediatric populations. Saliva is a heterogenous tissue composed of immune and buccal epithelial cells (BEC). Amongst the many epigenetic marks, DNA methylation (DNAm) is th...
Chan, M. H.-M.
•
Meijer, M.
•
Merrill, S. M.
•
Fu, M. P. Y.
...•
Kobor, M. S.
biorxiv
Thu May 08 2025
AI-powered integration of multi-source data for TAA discovery to accelerate ADC and TCE drug development (I): TAA Target Identification and Prioritization
The advancement of T-cell engagers (TCEs) and antibody-drug conjugates (ADCs) has been hindered by fragmented data landscapes. This paper, the first in a series, introduces an AI-driven framework specifically for tumor-associated antigen (TAA) target identification and prioritization, a critical initial step in TCE and ADC development. Our framework integrates diverse datasets, including multi-omi...
Xie, T.
•
Huang, C.-H.
biorxiv
Thu May 08 2025
Surforama: interactive exploration of volumetric data by leveraging 3D surfaces
Motivation: Visualization and annotation of segmented surfaces is of paramount importance for studying membrane proteins in their native cellular environment by cryogenic electron tomography (cryo-ET). Yet, analyzing membrane proteins and their organization is challenging due to their small sizes and the need to consider local context constrained to the membrane surface. Results: To interactively ...
Yamauchi, K. A.
•
Lamm, L.
•
Gaifas, L.
•
Righetto, R. D.
...•
Harrington, K.
biorxiv
Thu May 08 2025
INLAomics for Scalable and Interpretable Spatial Multiomic Data Integration
Integrating spatial transcriptomics with antibody-based proteomics enables the investigation of biological regulation within intact tissue architecture. However, current approaches for spatial multi-omics integration often depend on dimensionality reduction or autoencoders, which disregard spatial context, limit interpretability, and face challenges with scalability. To address these limitations, ...
Arnroth, L.
•
Vickovic, S.
biorxiv
Thu May 08 2025
Predicting Molecular Taste: Multi-Label and Multi-Class Classification
Predicting the taste of chemical compounds is a complex task and has been a challenge for decades. This study explores the application of machine learning to predict taste profiles of chemical compounds using the ChemTastesDB dataset, comprising 2,944 tastants categorized into 44 taste labels and 9 taste classes. Addressing the challenges of label imbalance and correlation, the dataset was preproc...
Ramanathan, V.
•
DN, S. S.
biorxiv
Thu May 08 2025
A novel machine learning-based algorithm for eQTL identification reveals complex pleiotropic effects in the MHC region
Expression quantitative trait loci (eQTLs) are regulatory variants that affect the expression level of their target genes and have significant impact on disease biology. However, eQTL mapping has been done mostly in one tissue at a time, despite the known prevalence of correlations among tissues. Multivariate analyses incorporating multiple phenotypes are available, but they emphasize linear combi...
Li, R. Y.
•
Su, C.
•
Qin, Z. S.
biorxiv
Thu May 08 2025
Deep learning inference of miRNA expression from bulk and single-cell mRNA expression
Understanding the activity of miRNA in individual cells presents a challenge due to the limitations of single-cell technologies in capturing miRNAs. To tackle this obstacle, we introduce two deep learning models: Cross-Modality (CM) and Single-Modality (SM). These models utilize encoder-decoder architectures to predict miRNA expression at the bulk and single-cell levels from mRNA data. We compared...
Ripan, R. C.
•
Athaya, T.
•
Li, x.
•
Hu, H.
biorxiv
Thu May 08 2025
GeneFix-AI: AI-Powered CRISPR-Cas9 System for Real-Time Detection and Correction of Mutations in Non-Human Species
The evolution of genome engineering technologies has transformed biomedical research, enabling precise and efficient modification of genetic material Doudna and Charpentier, 2014. Among these, CRISPR-Cas9 stands out as a revolutionary gene-editing tool, though it often requires extensive expertise and technical knowledge Cong et al., 2013; J. G. Doench et al., 2016. We propose GeneFix-AI, an Artif...
Ali, M.
biorxiv
Thu May 08 2025
ORANGE: A Machine Learning Approach for Modeling Tissue-Specific Aging from Transcriptomic Data
Despite aging being a fundamental biological process which profoundly influences health and disease, the interplay between tissue-specific aging and mortality remains underexplored. This study applies machine learning on GTEx transcriptomic data to model tissue-specific biological ages across 12 different types of tissues and introduces an age-gap metric to quantify deviations from the chronologic...
Jalal, W.
•
Musarrat, M.
•
Samee, M. A. H.
•
Rahman, M. S.