2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
July 17th, 2025
Version: 2
Imperial College London
bioinformatics
biorxiv

PromoterAtlas: decoding regulatory sequences across Gammaproteobacteria using a transformer model

Coppens, L.Open in Google Scholar•Ledesma-Amaro, R.Open in Google Scholar

Recent advances in deep learning, particularly transformer architectures, have improved computational approaches for biological sequence analysis. Despite these advances, computational models for bacterial promoter prediction have remained limited by small datasets, species-specific training, and binary classification approaches rather than comprehensive annotation frameworks. We present PromoterAtlas, a 1.8M parameter transformer model trained on 9M regulatory sequences from 3,371 gammaproteobacterial species. The model demonstrates recognition of various regulatory elements across different species, including ribosomal binding sites, various types of bacterial promoters, transcription factor binding sites, and terminators. Using this model, we developed a whole-genome promoter annotation tool for Gammaproteobacteria, with various levels of validation that support the predictions of promoters associated with different sigma ({sigma}) factors. Furthermore, we show that the model embeddings encode cross-species evolutionary relationships, clustering promoters by {sigma} factor identity rather than species-specific sequence features. Finally, we show that model embeddings encode regulatory sequence information that enables effective prediction of transcription and translation levels. PromoterAtlas can contribute to our understanding of and ability to engineer bacterial regulatory sequences with potential applications in bacterial biology, synthetic biology, and comparative genomics.

Similar Papers

biorxiv
Thu Jul 17 2025
Identifying associations between maize leaf transcriptome and bacteriome during different diurnal periods
Bacterial communities play important roles in the plant phyllosphere. Both microbial communities and their hosts have circadian rhythms and are subject to diurnal environmental changes. However, the interaction between the host and microbiome is still poorly understood. Here, we exploit paired sequencing data of host transcriptome and microbiome derived maize genotypes in field conditions and unde...
dos Santos, R. A. C.
•
Hidalgo-Martinez, K. J.
•
Munoz Perez, J. M.
•
Laspisa, D. J.
...•
Wallace, J.
biorxiv
Thu Jul 17 2025
Mapping the Metalloproteome of Deinococcus indicus DR1 through Integrative Structure and Function Annotation
Deinococcus indicus DR1 is a rod-shaped bacterium isolated from the Dadri wetlands (Uttar Pradesh, India) that tolerates ionizing radiation and arsenic. The molecular basis of its wider heavy-metal resilience, particularly among the 1017 out of 4128 proteins still annotated as hypothetical, remains unclear. We performed a proteome-wide structural and functional survey to address this gap. All the ...
Ramesh, S. D.
•
Vasan, G.
•
Senthilkumar, S.
•
Thambiraja, M.
...•
Yennamalli, R. M.
biorxiv
Thu Jul 17 2025
Improving causal effect estimation in multi-ancestry multivariable Mendelian randomization with transfer learning
Multivariable Mendelian randomization (MVMR) is widely used to estimate the causal effects of exposures on disease outcomes. However, its applications have been largely limited to individuals of European ancestry, due to the larger sample sizes available in European genome-wide association studies (GWAS). Although methods that jointly analyze multiple ancestries have been proposed to improve power...
Yang, Y.
•
Zhu, X.
biorxiv
Thu Jul 17 2025
A periodic table of bacteria?: Mapping bacterial diversity in trait space
Bacterial diversity can be overwhelming. There is an ever-expanding number of bacterial taxa being discovered, but many of these taxa remain uncharacterized with unknown traits and environmental preferences. This diversity makes it challenging to interpret ecological patterns in microbiomes and understand why individual taxa, or assemblages, may vary across space and time. While we can use informa...
Hoffert, M. C.
•
Lladser, M. E.
•
Gorman, E. D.
•
Fierer, N.
biorxiv
Thu Jul 17 2025
SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangenome analysis. However, the availability of structural variant (SV) calling tools that can effectively leverage pangenome information is limited. In addition, efficient construction of pangenome graphs becomes increasingly challenging with acquis...
Hu, H.
•
Gao, R.
•
Jiang, Z.
•
Cao, S.
...•
Wang, G.
biorxiv
Thu Jul 17 2025
MiroSCOPE: An AI-driven digital pathology platform for annotating functional tissue units
Cancer tissue analysis in digital pathology is typically conducted across different spatial scales, ranging from high-resolution cell-level modeling to lower-resolution tile-based assessments. However, these perspectives often overlook the structural organization of functional tissue units (FTUs), the small, repeating structures which are crucial to tissue function and key factors during pathologi...
Fenner, M. R.
•
Sevim, S.
•
Wu, G.
•
Beavers, D.
...•
Demir, E.
biorxiv
Thu Jul 17 2025
scDNAm-GPT: A Foundation Model for Capturing Long-Range CpG Dependencies in Single-Cell Whole-Genome Bisulfite Sequencing to Enhance Epigenetic Analysis
Accurately identifying development- and disease-associated DNA methylation features from single-cell DNA methylation data remains challenging due to the genome-wide scale and the sparse, stochastic nature of CpG coverage. We present scDNAm-GPT, a novel framework that integrates CpG token design, a Mamba backbone, and a cross-attention head to efficiently process ultra-long sequences while preservi...
Liang, C.
•
Ye, P.
•
Yan, H.
•
Zheng, P.
...•
Li, J.
biorxiv
Thu Jul 17 2025
mm2-ivh: simple and precise overlap detection in alpha satellite HORs with interval hashing
Summary: We propose a new algorithm, \"interval hashing,\" which distinguishes identical k-mers arising from different repeat sequences, particularly in complex repeat arrays such as alpha satellite HORs. We implement this algorithm as a fork of minimap2, named mm2-ivh. In local assembly of alpha satellite HORs, mm2-ivh accurately reconstructs more haplotypes than assemblers using standard minimiz...
Suzuki, H.
•
Sugawa, M.
•
Sakamoto, Y.
•
Shiraishi, Y.
biorxiv
Thu Jul 17 2025
MGMG: Cell Morphology-Guided Molecule Generation for Drug Discovery
Designing novel molecules with desired bioactivity remains a fundamental challenge in drug discovery. Most molecular design methods follow target-based drug discovery paradigms that rely on well-defined drug targets, thereby limiting their applicability to diseases lacking known targets or reference compounds. Here we introduce Morphology-Guided Molecule Generation (MGMG), a phenotypic drug discov...
Tang, Q.
•
Ding, D.
•
Yuan, X.
•
Seabra, G.
...•
Li, Y.