2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
July 4th, 2025
Version: 2
Kansas State University
bioinformatics
biorxiv

PepBERT: Lightweight language models for bioactive peptide representation

Du, Z.Open in Google Scholar•Caragea, D.Open in Google Scholar•Guo, X.Open in Google Scholar•Li, Y.Open in Google Scholar

Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limits the effectiveness of pLMs for peptide-specific applications. Here, we present PepBERT, a lightweight and efficient peptide language model specifically designed for encoding peptide sequences. Two versions of the model, PepBERT-large (4.9 million parameters) and PepBERT-small (1.86 million parameters), were pretrained from scratch using four custom peptide datasets and evaluated on nine peptide-related downstream prediction tasks. Both PepBERT models achieved performance superior to or comparable to the benchmark model, ESM-2 with 7.5 million parameters, on 8 out of 9 datasets. Overall, PepBERT provides a compact yet effective solution for generating high-quality peptide representations for downstream applications. By enabling more accurate representation and prediction of bioactive peptides, PepBERT can accelerate the discovery of food-derived bioactive peptides with health-promoting properties, supporting the development of sustainable functional foods and value-added utilization of food processing by-products. The datasets, source codes, pretrained models, and tutorials for the usage of PepBERT are available at https://github.com/dzjxzyd/PepBERT.

Similar Papers

biorxiv
Sat Jul 05 2025
DiffMethylTools: a toolbox of the detection, annotation and visualization of differential DNA methylation
DNA methylation is a compulsory and fundamental epigenetic mechanism, and its significant changes (i.e., differential methylation) regulate gene expression, cell-type specification and disease progression without altering the underlying DNA sequence. Differential methylation biomarkers were widely used as inputs for various downstream investigations, and differential methylation could be detected ...
Derbel, H.
•
Kinnear, E.
•
Wong, J.
•
Liu, Q.
biorxiv
Sat Jul 05 2025
PULPO: Pipeline of understanding large-scale patterns of oncogenomic signatures
PULPO v1.0 is a novel, fully automated pipeline designed for the preprocess and extraction of mutational signatures from raw Optical Genome Mapping (OGM) data. Built using Snakemake and executed within an isolated, Conda-managed environment, PULPO transforms complex cytogenetic alterations, captured at ultra-high resolution, into Catalogue of somatic mutations in Cancer (COSMIC)-based mutational s...
Portasany-Rodriguez, M.
•
Soria-Alcaide, G.
•
G.Sanchez, E.
•
Ivanova, M.
...•
Garcia-Martinez, J.
biorxiv
Sat Jul 05 2025
Regulation Flow Analysis discovers molecular mechanisms of action from large knowledge databases
Drug development is a long and expensive process, with only a small fraction of potential drugs being finally approved. The challenge of drug development is rooted in our limited understanding of biological systems and the disease processes that drugs are trying to modulate. We propose a novel method, called Regulation Flow Analysis (RFA), which is based on the principles of biological regulation,...
Roca, C. P.
•
Sysoev, O.
•
Eyre, E.
•
Galan, S.
...•
Mangion, J.
biorxiv
Sat Jul 05 2025
Modelling punctuated similarity
Inter-subject, pairwise similarity models provide a methodological resource for flexibly measuring complex, non-linear relationships between brain and behavior. Similarity models, however, can extend beyond brain behavior relationships and can be readily applied to any data where they may be useful. The work presented in this paper introduces a new way of modelling similarity, termed punctuated si...
Crockford, S. K.
biorxiv
Sat Jul 05 2025
Mechanistic modeling and machine learning identifies optimum radiotherapy schedules to prevent treatment-induced metastasis
Lung cancer patients often experience increased metastasis formation after radiotherapy. However, it is incompletely understood whether radiation affects the migratory behavior of tumor cells and how altered radiotherapy schedules might mitigate this risk. To address these questions, we performed live-cell microscopy experiments to profile changes in cell migration during radiation across 12 cance...
Graser, C.
•
Zhou, Z.
•
Schürch, M.
•
Moorhead, G.
...•
Michor, F.
biorxiv
Sat Jul 05 2025
A moderated statistical test for detecting shifts in homeolog expression ratios in allopolyploids
Allopolyploids arise through hybridization between related species, carrying multiple sets of chromosomes from distinct progenitor, referred to as subgenomes. Within allopolyploids, duplicated genes across subgenomes, called homeologs, are thought to enhance environmental robustness by shifting their expression ratios depending on environmental and developmental changes. However, existing methods ...
Sun, J.
•
Sese, J.
•
Shimizu, K. K.
biorxiv
Sat Jul 05 2025
In Situ Inference of Copy Number Variations in Image-Based Spatial Transcriptomics
Copy number variations (CNVs) drive cancer progression. So far, spatial CNV inference has relied on whole transcriptome-based sequencing technologies. However, advances in image-based spatial transcriptomics (iST) now enable high-plex gene measurement in situ. Here, we introduce an approach that adapts CNV inference to iST data, enabling spatial mapping of malignant clones and the tumor microenvir...
Jensen, A. E. V.
•
Crowell, H.
•
Pascual Reguant, A.
•
Ruano, I.
...•
Marco Salas, S.
biorxiv
Sat Jul 05 2025
OMIDIENT: Multiomics Integration for Cancer by Dirichlet Auto-Encoder Networks
To achieve a more comprehensive understanding of cancer, novel computational methods are required for the integrative analysis of data from different molecular layers, such as genomics, transcriptomics, and epigenomics. Here, we present a novel multi-omics integrative method that performs unsupervised representation learning, referred to as OMIDIENT: multiOMics Integration for cancer by DIrichlet ...
Safinianaini, N.
•
Valimaki, N.
•
Bresson, R.
•
Gorbonos, A.
...•
Marttinen, P.
biorxiv
Sat Jul 05 2025
Fold-Conditioned De Novo Binder Design via AlphaFold2-Multimer Hallucination.
De novo protein binder design has been revolutionized by deep learning methods, yet controlling binder topology remains a challenge. We introduce a fold-conditioned AlphaFold2-Multimer hallucination framework - FoldCraft - guided by a contact map similarity loss, enabling precise generation of binders with user-defined structural folds. This single loss function enforces fold-specific geometry whi...
Rustamov, K. R.
•
Baev, A. Y.
biorxiv
Sat Jul 05 2025
GatorST: A Versatile Contrastive Meta-Learning Framework for Spatial Transcriptomic Data Analysis
Introduction: Recent advances in spatial transcriptomics (ST) technologies have revolutionized our understanding of cellular functions by providing gene expression profiles with rich spatial context. Effectively learning spatial representations is crucial for downstream analyses and requires robust integration of spatial information with transcriptomic data. While existing methods have shown promi...
Wang, S.
•
Liu, Y.
•
Zhang, Z.
•
Song, Q.
•
Bian, J.