2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
March 6th, 2025
Version: 3
University of Wisconsin - Madison
bioinformatics
biorxiv

skDER & CiDDER: two scalable approaches for microbial genome dereplication

Salamzade, R.Open in Google Scholar•Kottapalli, A.Open in Google Scholar•Kalan, L. R.Open in Google Scholar

An abundance of microbial genomes have been sequenced in the past two decades. For fundamental comparative genomic investigations, where the goal is to determine the major gain and loss events shaping the pangenome of a species, it is often unnecessary and computationally onerous to include all available genomes in studies. In addition, over-representation of specific lineages due to sampling and sequencing bias can have undesired effects on evolutionary analyses. To assist users with genomic dereplication, selecting a subset of representative genomes, for downstream comparative genomic investigations, we developed skDER & CiDDER (https://github.com/raufs/skDER). skDER combines recent advances to efficiently estimate average nucleotide identity (ANI) between thousands of microbial genomes with two efficient algorithms for genomic dereplication. Further, CiDDER implements an approach whereby protein clusters are determined across all genomes and genomes are iteratively selected as representatives until a user-defined saturation of the total protein space is achieved. To support ease of use, several auxiliary functionalities are implemented within the two programs, including arguments to: (i) test the number of representative genomes resulting from a variety of clustering parameters, (ii) automate downloading of genomes belonging to a bacterial species or genus by name, (iii) cluster non-representative genomes to their closest representative genomes, and (iv) automatically filter predicted plasmids and phages prior to dereplication. We further assess the effects of filtering mobile genetic elements (MGEs) on ANI and alignment fraction (AF) estimates between pairs of genomes and find that MGEs tend to slightly deflate both metrics in one species.

Similar Papers

biorxiv
Sat Jul 05 2025
PULPO: Pipeline of understanding large-scale patterns of oncogenomic signatures
PULPO v1.0 is a novel, fully automated pipeline designed for the preprocess and extraction of mutational signatures from raw Optical Genome Mapping (OGM) data. Built using Snakemake and executed within an isolated, Conda-managed environment, PULPO transforms complex cytogenetic alterations, captured at ultra-high resolution, into Catalogue of somatic mutations in Cancer (COSMIC)-based mutational s...
Portasany-Rodriguez, M.
•
Soria-Alcaide, G.
•
G.Sanchez, E.
•
Ivanova, M.
...•
Garcia-Martinez, J.
biorxiv
Sat Jul 05 2025
Fold-Conditioned De Novo Binder Design via AlphaFold2-Multimer Hallucination.
De novo protein binder design has been revolutionized by deep learning methods, yet controlling binder topology remains a challenge. We introduce a fold-conditioned AlphaFold2-Multimer hallucination framework - FoldCraft - guided by a contact map similarity loss, enabling precise generation of binders with user-defined structural folds. This single loss function enforces fold-specific geometry whi...
Rustamov, K. R.
•
Baev, A. Y.
biorxiv
Sat Jul 05 2025
Modelling punctuated similarity
Inter-subject, pairwise similarity models provide a methodological resource for flexibly measuring complex, non-linear relationships between brain and behavior. Similarity models, however, can extend beyond brain behavior relationships and can be readily applied to any data where they may be useful. The work presented in this paper introduces a new way of modelling similarity, termed punctuated si...
Crockford, S. K.
biorxiv
Sat Jul 05 2025
A moderated statistical test for detecting shifts in homeolog expression ratios in allopolyploids
Allopolyploids arise through hybridization between related species, carrying multiple sets of chromosomes from distinct progenitor, referred to as subgenomes. Within allopolyploids, duplicated genes across subgenomes, called homeologs, are thought to enhance environmental robustness by shifting their expression ratios depending on environmental and developmental changes. However, existing methods ...
Sun, J.
•
Sese, J.
•
Shimizu, K. K.
biorxiv
Sat Jul 05 2025
GatorST: A Versatile Contrastive Meta-Learning Framework for Spatial Transcriptomic Data Analysis
Introduction: Recent advances in spatial transcriptomics (ST) technologies have revolutionized our understanding of cellular functions by providing gene expression profiles with rich spatial context. Effectively learning spatial representations is crucial for downstream analyses and requires robust integration of spatial information with transcriptomic data. While existing methods have shown promi...
Wang, S.
•
Liu, Y.
•
Zhang, Z.
•
Song, Q.
•
Bian, J.
biorxiv
Sat Jul 05 2025
DiffMethylTools: a toolbox of the detection, annotation and visualization of differential DNA methylation
DNA methylation is a compulsory and fundamental epigenetic mechanism, and its significant changes (i.e., differential methylation) regulate gene expression, cell-type specification and disease progression without altering the underlying DNA sequence. Differential methylation biomarkers were widely used as inputs for various downstream investigations, and differential methylation could be detected ...
Derbel, H.
•
Kinnear, E.
•
Wong, J.
•
Liu, Q.
biorxiv
Sat Jul 05 2025
Regulation Flow Analysis discovers molecular mechanisms of action from large knowledge databases
Drug development is a long and expensive process, with only a small fraction of potential drugs being finally approved. The challenge of drug development is rooted in our limited understanding of biological systems and the disease processes that drugs are trying to modulate. We propose a novel method, called Regulation Flow Analysis (RFA), which is based on the principles of biological regulation,...
Roca, C. P.
•
Sysoev, O.
•
Eyre, E.
•
Galan, S.
...•
Mangion, J.
biorxiv
Fri Jul 04 2025
Structural and dynamic study of fungal cell wall degrading fungal chitinase and its interaction with chitooligosaccharide
Chitin, comprising of repeating units of N-acetyl-glucosamine, is the second most abundant polymer occurring in wide range of insects, fungi, yeasts and plants. Chitinases hydrolyze chitin into chitooligomers which finds multifarious uses in various sectors and are gaining attention particularly as a biocontrol agent against chitin-containing insects and plant pathogens. Although fungi are a signi...
Jana, U. K.
•
Shukla, P.
•
Kango, N.
biorxiv
Fri Jul 04 2025
OmniCorr: An R-package for visualizing putative host-microbiota interactions using multi-omics data
Holo-omics leverages omics datasets to explore the interactions between hosts and their associated microbiomes. Although the generation of omics data from matching host and microbiome samples is steadily increasing, there remains a scarcity of computational tools capable of integrating and visualizing this data to facilitate the interpretation and prediction of host-microbiota interactions. We pre...
Gupta, S.
•
Lai, W.
•
Kobel, C. M.
•
Aho, V. T. E.
...•
Hvidsten, T. R.
biorxiv
Fri Jul 04 2025
PepBERT: Lightweight language models for bioactive peptide representation
Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limits the effective...
Du, Z.
•
Caragea, D.
•
Guo, X.
•
Li, Y.