April 3rd, 2025
Version: 2
Botnar Research Centre, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, National Institute of Health Research Oxford Biomedical
molecular biology
biorxiv

mclUMI: Markov clustering of unique molecular identifiers enables dynamic removal of PCR duplicates

Molecular quantification in high-throughput sequencing experiments relies on accurate identification and removal of polymerase chain reaction (PCR) duplicates. The use of Unique Molecular Identifiers (UMIs) in sequencing protocols has become a standard approach for distinguishing molecular identities. However, PCR artefacts and sequencing errors in UMIs present a significant challenge for effective UMI collapsing and accurate molecular counting. Current computational strategies for UMI collapsing often exhibit limited flexibility, providing invariable deduplicated counts that inadequately adapt to varying experimental conditions. To address these limitations, we developed mclUMI, a tool employing the Markov clustering algorithm to accurately identify original UMIs and eliminate PCR duplicates. Unlike conventional methods, mclUMI automates the detection of independent communities within UMI graphs by dynamically fine-tuning inflation and expansion parameters, enabling context-dependent merging of UMIs based on their connectivity patterns. Through in silico experiments, we demonstrate that mclUMI generates dynamically adaptable deduplication outcomes tailored to diverse experimental scenarios, particularly best-performing under high sequencing error rates. By integrating connectivity-driven clustering, mclUMI enhances the accuracy of molecular counting in noisy sequencing environments, addressing the rigidity of current UMI deduplication frameworks.

Similar Papers

biorxiv
Mon Apr 07 2025
Multi-omic analysis reveals the unique glycan landscape of the blood-brain barrier glycocalyx
The blood-brain barrier (BBB) glycocalyx is the dense layer of glycans and glycoconjugates that coats the luminal surface of the central nervous system (CNS) vasculature. Despite being the first point of contact between the blood and brain, not much is known about the BBB glycocalyx. Here, we performed a multi-omic investigation of the BBB glycocalyx which revealed a unique glycan landscape charac...
Larsen, R.
Kucharz, K.
Aydin, S.
Micael, M. K. B.
...
Daneman, R.
biorxiv
Mon Apr 07 2025
Effects of in vitro hemolysis and repeated freeze-thaw cycles in protein abundance quantification using the SomaScan and Olink assays
SomaScan and Olink are affinity-based platforms that aim to estimate the relative abundance of thousands of human proteins with a broad range of endogenous concentrations. In this study, we investigated the effects of in vitro hemolysis and repeated freeze-thaw cycles in protein abundance quantification across 10,776 (11K SomaScan) and 1472 (Olink Explore 1536) analytes, respectively. Using SomaSc...
Candia, J.
Fantoni, G.
Moaddel, R.
Delgado-Peraza, F.
...
Ferrucci, L.
biorxiv
Mon Apr 07 2025
Multiplex base editing of BCL11A regulatory elements to treat sickle cell disease
Sickle cell disease (SCD) is a genetic anemia caused by the production of an abnormal adult hemoglobin. The clinical severity is lessened by elevated fetal hemoglobin (HbF) production in adulthood. A promising therapy is the transplantation of autologous, hematopoietic stem/progenitor cells (HSPCs) treated with CRISPR/Cas9 to downregulate the HbF repressor BCL11A via generation of double strand br...
Fontana, L.
Martinucci, P.
Amistadi, S.
FELIX, T.
...
Miccio, A.
biorxiv
Sun Apr 06 2025
Protein Secondary Structure Patterns In Short-Range Cross-Link Atlas
Cross-linking mass spectrometry (XL-MS) has become a powerful tool in structural biology for investigating protein structure, dynamics, and interactomics. However, short-range cross-links, defined as those connecting residues fewer than 20 positions apart, have traditionally been considered less informative and largely overlooked, leaving significant data unexplored in a systematic manner. Here, w...
Vetrano, A.
Di Ianni, A.
Di Fonte, N.
Dell'Orletta, G.
...
Iacobucci, C.
biorxiv
Sun Apr 06 2025
BoltzDesign1: Inverting All-Atom Structure Prediction Model for Generalized Biomolecular Binder Design
Deep learning in structure prediction has revolutionized protein research, enabling large-scale screening, novel hypothesis generation, and accelerated experimental design across biological domains. Recent advances, including RoseTTAFold-AA and AlphaFold3, have extended structure prediction models to work with small molecules, nucleic acids, ions, and covalent modifications. We present BoltzDesign...
Cho, Y.
Pacesa, M.
Zhang, Z.
Correia, B.
Ovchinnikov, S.
biorxiv
Sun Apr 06 2025
The effect of G-quadruplexes on TDP43 condensation, distribution, and toxicity
The events that lead to protein misfolding diseases are not fully understood. Many proteins implicated in neurodegenerative diseases (e.g., TDP43) interact with nucleic acids, including RNA G-quadruplexes. In this work, we investigate whether RNA G-quadruplexes play a role in TDP43 condensation in biophysical and cellular models. We find that G-quadruplexes modulate TDP43 aggregation in vitro and ...
Oldani, E. G.
Reynolds Caicedo, K. M.
Spaeth Herda, M. E.
Sachs, A. H.
...
Horowitz, S.
biorxiv
Sun Apr 06 2025
Functional and structural characterization of AtAbf43C: An exo-1,5-α-L-arabinofuranosidase from Acetivibrio thermocellus DSM1313
The Acetivibrio thermocellus DSM1313 genome codes for seven predicted glycoside hydrolase family 43 (GH43) enzymes, four of which remain uncharacterized. This study describes the function and structure of one such enzyme, AtAbf43C, from GH43 subfamily 26 (GH43_26) which acts as an -L-arabinofuranosidase (EC 3.2.1.55). AtAbf43C is active on para-nitrophenol--L-arabinofuranoside (pNPAra), with optim...
Galindo, J. L.
Jeffrey, P. D.
Zhu, A.
Link, A. J.
Conway, J. M.
biorxiv
Sun Apr 06 2025
DNA2 and MSH2 activity collectively mediate chemically stabilized G4 for efficient telomere replication
G-quadruplexes (G4s) are widely existing stable DNA secondary structures in mammalian cells. A long-standing hypothesis is that timely resolution of G4s is needed for efficient and faithful DNA replication. In vitro, G4s may be unwound by helicases or alternatively resolved via DNA2 nuclease mediated G4 cleavage. However, little is known about the biological significance and regulatory mechanism o...
Fernandez, A.
Zhou, T.
Esworthy, S.
Shen, C.
...
Shen, B.
biorxiv
Sun Apr 06 2025
TEAD1 is a novel regulator of NRF2 and oxidative stress response in cardiomyocytes
BACKGROUND: TEAD1, the mammalian Hippo pathway regulated transcription factor, plays a critical and non-redundant role in maintaining cardiomyocyte (CM) homeostasis. However, the specific cellular pathways regulated by TEAD1 in CMs remain poorly defined. We hypothesized that TEAD1 has an essential, cell autonomous role in the CM oxidative stress response by directly regulating the transcription of...
Jagannathan, R.
Lee, J.
De Vallance, E.
Negi, V.
...
Moulik, M.
biorxiv
Sun Apr 06 2025
Target-induced Argonaute-HNH filaments confer bacterial immunity
Argonaute proteins provide innate immunity in all domains of life through guide-dependent recognition of invader nucleic acids. While eukaryotic Argonautes (eAgos) act on RNA during RNA interference, prokaryotic Argonautes (pAgos) mainly recognize DNA targets. Many eAgos and some pAgos are active nucleases that directly cleave their targets. In contrast, short pAgos lack the nuclease activity and ...
Kanevskaya, A.
Lisitskaya, L.
Moiseenko, A. V.
Sokolova, O. S.
...
Kulbachiskiy, A.