2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
June 5th, 2025
Version: 1
East China Normal University
bioinformatics
biorxiv

Integrating Multimodal Data for a Comprehensive Knowledge Graph to Advance Infectious Disease Research

Fan, H.Open in Google Scholar•Guo, L.Open in Google Scholar•Li, F.Open in Google Scholar•Yuan, Z.Open in Google Scholar•Deng, Y.Open in Google Scholar•Xiao, Y.Open in Google Scholar•Li, H.Open in Google Scholar•Li, S.Open in Google Scholar

Infectious diseases remain a formidable threat to global public health, with their escalating morbidity and mortality rates compounded by recurrent epidemics and the alarming rise of antimicrobial resistance (AMR). These challenges have intensified the urgent demand for innovative therapeutic strategies that can accelerate drug development cycles and overcome traditional research bottlenecks. To address these critical needs, we present IDKG (Infectious Disease Knowledge Graph), a specialized large-scale biomedical knowledge network designed to bridge data fragmentation through multimodal data integration. The IDKG constructs comprehensive associations from 345 infectious diseases and 708 pathogens across heterogeneous biomedical sources systematically. The graph architecture comprises nearly 50,000 nodes (8 types, including Pathogen, Protein, etc.) and over 1.2 million edges (11 types, including treats, contains, etc.), establishing an interconnected framework that enables systematic interrogation of cross-disciplinary knowledge. The integrative approach effectively dismantles conventional data silos while preserving biological contextuality. We validated the IDKG\'s potential by applying graph neural network-based approaches for drug repurposing prediction in human metapneumovirus (hMPV) infection, a common acute respiratory infection for which effective specific antiviral drugs are currently absent. The successfully identification of established antiviral agents, such as ribavirin and emetine, by our M1 model demonstrated its predictive accuracy and biological relevance. IDKG unifies multimodal biomedical data into a network to accelerate drug discovery and bolster outbreak response. This establishes a data-driven, knowledge-based paradigm for infectious disease research.

Similar Papers

biorxiv
Fri Jun 06 2025
SCNT: An R Package for Data Analysis and Visualization of Single-Cell and Spatial Transcriptomics
Background: The emergence of single-cell (SC) and spatial transcriptomics (ST) has revolutionized our understanding of gene expression dynamics in complex tissues. However, it also presents challenges for data analysis and visualization, particularly due to the complexity of ST data and the diversity of analysis platforms. The SCNT (Single-Cell, Single-Nucleus, and Spatial Transcriptomics Analysis...
Qing, J.
•
Wu, J.
•
Li, Y.
•
Wu, J.
biorxiv
Fri Jun 06 2025
OriGene: A Self-Evolving Virtual Disease Biologist Automating Therapeutic Target Discovery
Therapeutic target discovery remains a critical yet intuition-driven bottleneck in drug development, typically relying on disease biologists to laboriously integrate diverse biomedical data into testable hypotheses for experimental validation. Here, we present OriGene, a self-evolving multi-agent system that functions as a virtual disease biologist, systematically identifying original and mechanis...
Zhang, Z.
•
Qiu, Z.
•
Wu, Y.
•
Li, S.
...•
Zheng, S.
biorxiv
Fri Jun 06 2025
Amira: detection of AMR genes directly from long reads using gene-space de Bruijn graphs
Accurate detection of antimicrobial resistance (AMR) genes is essential for the surveillance, epidemiology and genotypic prediction of AMR. This is typically done by generating an assembly from the sequencing reads of a bacterial isolate and running AMR gene detection tools on the assembly. However, despite advances in long-read sequencing that have greatly improved the quality and completeness of...
Anderson, D.
•
Lima, L.
•
Le, T.
•
Judd, L. M.
...•
Iqbal, Z.
biorxiv
Fri Jun 06 2025
Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein unknome. This large knowledge shortfall is one of the final frontiers of biology. Machine-Learning (ML) approaches are enticing, with early successes demonstrating the ability to propagate functional knowledge from experimentally characterized proteins. An open question is the a...
de Crecy-Lagard, V.
•
Dias, R.
•
Sexson, N.
•
Friedberg, I.
...•
Swairjo, M.
biorxiv
Fri Jun 06 2025
An improved model for prediction of de novo designed proteins with diverse geometries
Nature uses structural variations on protein folds to fine-tune the geometries of proteins for diverse functions, yet deep learning-based de novo protein design methods generate highly regular, idealized protein fold geometries that fail to capture natural diversity. Here, using physics-based design methods, we generated and experimentally validated a dataset of 5,996 stable, de novo designed prot...
Orr, B.
•
Crilly, S. E.
•
Akpinaroglu, D.
•
Zhu, E.
...•
Kortemme, T.
biorxiv
Fri Jun 06 2025
Pangenome-aware DeepVariant
Population-scale genomics information provides valuable prior knowledge for various genomic analyses, especially variant calling. A notable example of such application is the human pangenome reference released by the Human Pangenome Reference Consortium, which has been shown to improve read mapping and structural variant genotyping. In this work, we introduce pangenome-aware DeepVariant, a variant...
Asri, M.
•
Chang, P.-C.
•
Mier, J. C.
•
Siren, J.
...•
Shafin, K.
biorxiv
Fri Jun 06 2025
sCIN: A Contrastive Learning Framework for Single-Cell Multi-omics Data Integration
The rapid advancement of single-cell omics technologies such as scRNA-seq and scATAC-seq has transformed our understanding of cellular heterogeneity and regulatory mechanisms. However, integrating these data types remains challenging due to distributional discrepancies and distinct feature spaces. To address this, we present a novel single-cell Contrastive INtegration framework (sCIN), that integr...
Ebrahimi, A.
•
Siahpirani, A. F.
•
Montazeri, H.
biorxiv
Fri Jun 06 2025
Global profiling of the proteome and acetylome in mice with abdominal aortic aneurysms
Objective: Abdominal Aortic Aneurysm (AAA) is a life-threatening vascular disease with a high risk of rupture. Current treatments rely on surgery, as effective drug therapies remain unavailable due to limited understanding of disease mechanisms and a lack of therapeutic targets. This study aims to identify potential targets for pharmacological intervention through global proteomic and acetylomic a...
Yang, J.
•
Zhang, L.
•
Yang, B.
•
Ding, T.
...•
Liu, J.
biorxiv
Thu Jun 05 2025
Machine learning driven acceleration of biopharmaceutical formulation development using Excipient Prediction Software (ExPreSo)
Formulation development of protein biopharmaceuticals has become increasingly challenging due to new modalities and higher desired drug substance concentrations. The constraint in drug substance supply and the need for many analytical methods means that only a small selection of excipients can be thoroughly tested in the lab. There are few in-silico tools developed to refine the candidate excipien...
Vidal-Henriquez, E.
•
Holder, T.
•
Lee, N. F.
•
Pompe, C.
•
Teese, M. G.
biorxiv
Thu Jun 05 2025
Learning Genetic Perturbation Effects with Variational Causal Inference
Advances in sequencing technologies have enhanced the understanding of gene regulation in cells. In particular, Perturb-seq has enabled high-resolution profiling of the transcriptomic response to genetic perturbations at the single-cell level. This understanding has implications in functional genomics and potentially for identifying therapeutic targets. Various computational models have been devel...
Liu, E.
•
Zhang, J.
•
Uhler, C.