2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
June 30th, 2025
Version: 3
BGI Research, Hangzhou 310030, China; BGI Research, Shenzhen 518083, China
bioinformatics
biorxiv

FEDRANN: effective long-read overlap detection based on dimensionality reduction and approximate nearest neighbors

Zhang, J.-Y.Open in Google Scholar•Miao, C.Open in Google Scholar•Qiu, T.Open in Google Scholar•Xia, X.Open in Google Scholar•He, L.Open in Google Scholar•He, J.Open in Google Scholar•Yang, C.Open in Google Scholar•Sun, Y.Open in Google Scholar•Zeng, T.Open in Google Scholar•Li, Y.Open in Google Scholaret al.

Overlap detection is a key step in de novo genome assembly pipelines based on the Overlap-Layout-Consensus (OLC) paradigm. However, existing methods for overlap detection either rely on heuristic seed-and-extension strategies or locality-sensitive hashing (LSH), both of which struggle to handle repetitive genomic regions and the computational burden of large-scale datasets. Here, we present FEDRANN, a novel strategy for overlap graph construction that integrates feature extraction, dimensionality reduction (DR), and approximate nearest neighbor (ANN) search. We find the pipeline combining inverse document frequency (IDF) transformation, sparse random projection (SRP), and NNDescent enables accurate detection of overlaps across diverse datasets. We developed an efficient open-source implementation of this pipeline named Fedrann (https://github.com/jzhang-dev/fedrann). Through systematic benchmarking on real long-read sequencing data, we demonstrate that Fedrann produces overlap graphs comparable to or better than those generated by existing state-of-the-art tools, including MECAT2, minimap2, and wtdbg2, while maintaining competitive runtime. Despite being implemented primarily in Python, Fedrann achieves performance on par with tools written in compiled languages, owing to matrix-based representations and C-accelerated numerical libraries. Our results suggest that DR and ANN techniques offer a promising new direction for scalable and accurate overlap detection in long-read assembly and broader sequence similarity search tasks.

Similar Papers

biorxiv
Mon Jun 30 2025
Controllable Protein Design by Prefix-Tuning Protein Language Models
The design of novel proteins with tailored functionalities, particularly in drug discovery and vaccine development, presents a transformative approach to addressing pressing biomedical challenges. Inspired by the remarkable success of pre-trained language models in natural language processing (NLP), protein language models (ProtLMs) have emerged as powerful tools in advancing protein science. Whil...
Luo, J.
•
Liu, X.
•
Li, J.
•
Zhang, Y.
...•
Chen, J.
biorxiv
Mon Jun 30 2025
STORIES: learning cell fate landscapes from spatial transcriptomics
In dynamic biological processes such as development, spatial transcriptomics is revolutionizing the study of the mechanisms underlying spatial organization within tissues. Inferring cell fate trajectories from spatial transcriptomics profiled at several time points has thus emerged as a critical goal, requiring novel computational methods. Wasserstein gradient flow learning is a promising framewor...
Huizing, G.-J.
•
Samaran, J.
•
Capocefalo, D.
•
Audit, A.
...•
Cantini, L.
biorxiv
Mon Jun 30 2025
Representation Learning Methods for Single-Cell Microscopy are Confounded by Background Cells
Deep learning models are widely used to extract feature representations from microscopy images. While these models are used for single-cell analyses, such as studying single-cell heterogeneity, they typically operate on image crops centered on individual cells with background information present, such as other cells, and it remains unclear to what extent the conclusions of single-cell analyses may...
Gupta, A.
•
Moses, A.
•
Lu, A. X.
biorxiv
Mon Jun 30 2025
Molecular characterization of unique multi-domain harbouring fungal rhodopsin for establishing their novel opto-synthetic biological usages
Organisms employ light as an external stimulus for regulating cellular functions. The light-sensitive photoreceptors detect light at varying wavelengths, activating signaling cascades and triggering a range of physiological responses. Rhodopsin is a transmembrane heptahelical protein that functions as an ion channel, or a pump, and sensory receptor, respectively. It consists of a light-sensing chr...
Kumari, A.
•
Kumar, A.
•
Sharma, K.
•
Pati, S. R.
...•
KATERIYA, S.
biorxiv
Mon Jun 30 2025
A Systematic Benchmark of High-Accuracy PacBio Long-Read RNA Sequencing for Transcript-Level Quantification
PacBio long-read RNA sequencing resolves transcripts with greater clarity than short-read technologies, yet its quantitative performance remains under-evaluated at scale. Here, we benchmark the high-throughput PacBio Kinnex platform against Illumina short-read RNA-seq using matched, deeply sequenced datasets across a time course of endothelial cell differentiation. Compared to Illumina, Kinnex ach...
Wissel, D.
•
Mehlferber, M. M.
•
Nguyen, K. M.
•
Pavelko, V.
...•
Sheynkman, G. M.
biorxiv
Mon Jun 30 2025
scHDeepInsight: A Hierarchical Deep Learning Framework for Precise Immune Cell Annotation in Single-Cell RNA-seq Data
Immune cell classification from single-cell RNA sequencing (scRNA-seq) presents significant challenges due to complex hierarchical relationships among cell types. We introduce scHDeepInsight, a deep learning framework that extends our previous scDeepInsight model by integrating a biologically-informed classification architecture with an adaptive hierarchical focal loss. The framework leverages our...
JIA, S.
•
Lysenko, A.
•
Boroevich, K. A.
•
Sharma, A.
•
Tsunoda, T.
biorxiv
Mon Jun 30 2025
reconcILS: A gene tree-species tree reconciliation algorithm that allows for incomplete lineage sorting
Reconciliation algorithms provide an accounting of the evolutionary history of individual gene trees given a species tree. Many reconciliation algorithms consider only duplication and loss events (and sometimes horizontal transfer), ignoring effects of the coalescent process, including incomplete lineage sorting (ILS). Here, we present a new algorithm for carrying out reconciliation that accuratel...
Mishra, S.
•
Smith, M. L.
•
Hahn, M. W.
biorxiv
Mon Jun 30 2025
Identifying Optimal Machine Learning Approaches for Microbiome-Metabolomics Integration with Stable Feature Selection
Microbiome research has been limited by methodological inconsistencies. Taxonomy-based profiling presents challenges such as data sparsity, variable taxonomic resolution, and the reliance on DNA-based profiling, which provides limited functional insight. Multi-omics integration has emerged as a promising approach to link microbiome composition with function. However, the lack of standardized metho...
Palmer, S. N.
•
Mishra, A. A.
•
Gan, S.
•
Liu, D.
...•
Zhan, X.
biorxiv
Mon Jun 30 2025
Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma
The emergence of genomic language models (gLMs) has revolutionized the analysis of genomic sequences, enabling robust capture of biologically meaningful patterns from DNA sequences for an improved understanding of human genome-wide regulatory programs, variant pathogenicity and therapeutic discovery. Given that DNA serves as the foundational blueprint within the central dogma, the ultimate evaluat...
Wang, Y.
•
Cai, Z.
•
Zeng, Q.
•
Gao, Y.
...•
Chen, H.