2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
July 17th, 2025
Version: 2
Guangzhou Medical University, Guangzhou National Laboratory
bioinformatics
biorxiv

scDNAm-GPT: A Foundation Model for Capturing Long-Range CpG Dependencies in Single-Cell Whole-Genome Bisulfite Sequencing to Enhance Epigenetic Analysis

Liang, C.Open in Google Scholar•Ye, P.Open in Google Scholar•Yan, H.Open in Google Scholar•Zheng, P.Open in Google Scholar•Sun, J.Open in Google Scholar•Wang, Y.Open in Google Scholar•Li, Y.Open in Google Scholar•Ren, Y.Open in Google Scholar•Jiang, Y.Open in Google Scholar•Wei, R.Open in Google Scholaret al.

Accurately identifying development- and disease-associated DNA methylation features from single-cell DNA methylation data remains challenging due to the genome-wide scale and the sparse, stochastic nature of CpG coverage. We present scDNAm-GPT, a novel framework that integrates CpG token design, a Mamba backbone, and a cross-attention head to efficiently process ultra-long sequences while preserving both local CpG interactions and broader genomic context. Pretrained on over one million single cells from 28 human and mouse tissues, scDNAm-GPT effectively reconstructs sparse methylation landscapes, enhancing the resolution and accuracy of epigenetic analyses. It outperforms existing methods across key biomedical applications, including improved cell clustering, enhanced trajectory inference for precise mapping of differentiation pathways, identification of disease-relevant DNA methylation features, and robust, reference-free cell type deconvolution from cfDNA data. scDNAm-GPT learns regulatory features in a hierarchical manner and and its attention scores exhibit high biological interpretability by highlighting functionally relevant CpG regions. These advancements establish scDNAm-GPT as a scalable and generalizable solution for single-cell epigenomic analysis, paving the way for broader applications in single-cell DNA methylation profiling and uncovering novel insights into the epigenetic mechanisms underlying health and disease.

Similar Papers

biorxiv
Fri Jul 18 2025
A Deep Learning-based Method for Drug Molecule Representation and Property Prediction
Accurately and robustly representing drug molecule features, prediction of drug-target biomacromolecule interactions, and determining drug molecule physicochemical properties are crucial in drug development. However, due to issues such as insufficient generalization ability of single-modal representation, lack of multi-task prediction frameworks, and weak adaptability in cold-start scenarios, thes...
Zhang, Q.
•
Yu, X.
•
Wei, y.
•
Wang, Z.-H.
•
Yu, D.-J.
biorxiv
Thu Jul 17 2025
Mapping the Metalloproteome of Deinococcus indicus DR1 through Integrative Structure and Function Annotation
Deinococcus indicus DR1 is a rod-shaped bacterium isolated from the Dadri wetlands (Uttar Pradesh, India) that tolerates ionizing radiation and arsenic. The molecular basis of its wider heavy-metal resilience, particularly among the 1017 out of 4128 proteins still annotated as hypothetical, remains unclear. We performed a proteome-wide structural and functional survey to address this gap. All the ...
Ramesh, S. D.
•
Vasan, G.
•
Senthilkumar, S.
•
Thambiraja, M.
...•
Yennamalli, R. M.
biorxiv
Thu Jul 17 2025
A periodic table of bacteria?: Mapping bacterial diversity in trait space
Bacterial diversity can be overwhelming. There is an ever-expanding number of bacterial taxa being discovered, but many of these taxa remain uncharacterized with unknown traits and environmental preferences. This diversity makes it challenging to interpret ecological patterns in microbiomes and understand why individual taxa, or assemblages, may vary across space and time. While we can use informa...
Hoffert, M. C.
•
Lladser, M. E.
•
Gorman, E. D.
•
Fierer, N.
biorxiv
Thu Jul 17 2025
SVPG: A pangenome-based structural variant detection approach and rapid augmentation of pangenome graphs with new samples
Breakthrough advances in long-read sequencing technologies have opened unprecedented opportunities to study genetic variations through comprehensive pangenome analysis. However, the availability of structural variant (SV) calling tools that can effectively leverage pangenome information is limited. In addition, efficient construction of pangenome graphs becomes increasingly challenging with acquis...
Hu, H.
•
Gao, R.
•
Jiang, Z.
•
Cao, S.
...•
Wang, G.
biorxiv
Thu Jul 17 2025
MiroSCOPE: An AI-driven digital pathology platform for annotating functional tissue units
Cancer tissue analysis in digital pathology is typically conducted across different spatial scales, ranging from high-resolution cell-level modeling to lower-resolution tile-based assessments. However, these perspectives often overlook the structural organization of functional tissue units (FTUs), the small, repeating structures which are crucial to tissue function and key factors during pathologi...
Fenner, M. R.
•
Sevim, S.
•
Wu, G.
•
Beavers, D.
...•
Demir, E.
biorxiv
Thu Jul 17 2025
Improving causal effect estimation in multi-ancestry multivariable Mendelian randomization with transfer learning
Multivariable Mendelian randomization (MVMR) is widely used to estimate the causal effects of exposures on disease outcomes. However, its applications have been largely limited to individuals of European ancestry, due to the larger sample sizes available in European genome-wide association studies (GWAS). Although methods that jointly analyze multiple ancestries have been proposed to improve power...
Yang, Y.
•
Zhu, X.
biorxiv
Thu Jul 17 2025
PromoterAtlas: decoding regulatory sequences across Gammaproteobacteria using a transformer model
Recent advances in deep learning, particularly transformer architectures, have improved computational approaches for biological sequence analysis. Despite these advances, computational models for bacterial promoter prediction have remained limited by small datasets, species-specific training, and binary classification approaches rather than comprehensive annotation frameworks. We present PromoterA...
Coppens, L.
•
Ledesma-Amaro, R.
biorxiv
Thu Jul 17 2025
mm2-ivh: simple and precise overlap detection in alpha satellite HORs with interval hashing
Summary: We propose a new algorithm, \"interval hashing,\" which distinguishes identical k-mers arising from different repeat sequences, particularly in complex repeat arrays such as alpha satellite HORs. We implement this algorithm as a fork of minimap2, named mm2-ivh. In local assembly of alpha satellite HORs, mm2-ivh accurately reconstructs more haplotypes than assemblers using standard minimiz...
Suzuki, H.
•
Sugawa, M.
•
Sakamoto, Y.
•
Shiraishi, Y.
biorxiv
Thu Jul 17 2025
Identifying associations between maize leaf transcriptome and bacteriome during different diurnal periods
Bacterial communities play important roles in the plant phyllosphere. Both microbial communities and their hosts have circadian rhythms and are subject to diurnal environmental changes. However, the interaction between the host and microbiome is still poorly understood. Here, we exploit paired sequencing data of host transcriptome and microbiome derived maize genotypes in field conditions and unde...
dos Santos, R. A. C.
•
Hidalgo-Martinez, K. J.
•
Munoz Perez, J. M.
•
Laspisa, D. J.
...•
Wallace, J.