2025 Hyper Recent •CC0 1.0 Universal

This work is dedicated to the public domain. No rights reserved.

Access Preprint From Server
July 2nd, 2025
Version: 1
New York Genome Center
genomics
biorxiv

Perplexity as a Metric for Isoform Diversity in the Human Transcriptome

Schertzer, M. D.Open in Google Scholar•Park, S. H.Open in Google Scholar•Su, J.Open in Google Scholar•Sheynkman, G. M.Open in Google Scholar•Knowles, D. A.Open in Google Scholar

Long-read sequencing (LRS) has revealed a far greater diversity of RNA isoforms than earlier technologies, increasing the critical need to determine which, and how many, isoforms per gene are biologically meaningful. To define the space of relevant isoforms from LRS, many existing analysis pipelines rely on arbitrary expression cutoffs, but a single threshold cannot accommodate the broad variability in isoform complexity across genes, cell-types, and disease states captured by LRS. To address this, we propose using perplexity, an interpretable measure derived from entropy, that quantifies the effective number of isoforms per gene based on the full, unfiltered isoform ratio distribution. Calculating perplexity for 124 ENCODE4 PacBio LRS datasets spanning 55 human cell types, we show that it provides intuitive assessments of isoform diversity and captures uncertainty across genes with varying complexity. Perplexity can be calculated at multiple gene regulatory levels, from transcript to protein, to compare how isoform diversity is reduced across stages of gene expression. On average, genes have an ORF-level perplexity of 2.1, indicating production of two distinct protein isoforms. We extended this analysis to evaluate expression variation across tissues and identified 4,593 ORFs across 3,102 genes with moderate to extreme tissue-specificity. We propose perplexity as a consistent, quantitative metric for interpreting isoform diversity across genes, cell types, and disease states. All results are compiled into a community resource to enable cross-study comparisons of novel isoforms.

Similar Papers

biorxiv
Wed Jul 02 2025
Persistent Activation of Endothelial Cells is Linked to Thrombosis and Inflammation in Cerebral Cavernous Malformation Disease
BACKGROUND: Cerebral cavernous malformations (CCM) are neurovascular lesions that affect both children and adults, and morbidity often results from thrombosis, bleeding, and neurological dysfunction. Studies indicate that inflammation-related activation of endothelial cells contributes significantly to the worsening of CCM disease. This suggests that ongoing vascular inflammation and endothelial d...
Gallego-Gutierrez, H.
•
Frias-Anaya, E.
•
Bui, C.
•
Zhao, L.
...•
Lopez-Ramirez, M. A.
biorxiv
Wed Jul 02 2025
Illuminating the mystery of thylacine extinction: a role for relaxed selection and gene loss
Gene loss shapes lineage-specific traits but is often overlooked in species survival. In this study, we investigate the role of ancestral gene loss using the extinction icon - thylacine (Thylacinus cynocephalus). While studies of neutral genetic variation indicate a population decline before extinction, the impact of thylacine-specific ancestral gene losses remains unexplored. The availability of ...
Salve, B. G.
•
Vijay, N.
biorxiv
Wed Jul 02 2025
Integrative Transcriptomic and Machine Learning Approaches to decipher Mitochondrial Gene Regulation in severe Plasmodium vivax Malaria
Mitochondria in Plasmodium vivax are functionally vital despite possessing a highly reduced genome and differing substantially from the human organelle. Beyond their classical role in energy production, they dynamically coordinate processes like pyrimidine biosynthesis and heme metabolism, adapting their functions across the intra-erythrocytic development cycle (IDC). Their unique architecture and...
Roy, P.
•
Aggarwal, Y.
•
Kochar, S. K.
•
Kochar, D. K.
•
Das, A.
biorxiv
Wed Jul 02 2025
An optimised computational approach for the identification of somatic structural variants in cancer
Structural variants play a critical role in tumorigenesis. At present, these events are most commonly identified using short-read whole-genome sequencing data, and a number of computational tools are available for this purpose. Consensus approaches have been used to improve precision, but may reduce sensitivity. The optimal number and combination of callers remains unclear, in part due to the lack...
Waise, S.
•
Mensah, N.
•
Lesluyes, T.
•
Demeulemeester, J.
...•
Van Loo, P.
biorxiv
Wed Jul 02 2025
Language models learn to represent antigenic properties of human influenza A(H3) virus
Given that influenza vaccine effectiveness depends on a good antigenic match between the vaccine and circulating viruses, it is important to assess the antigenic properties of newly emerging variants continuously. With the increasing application of real-time pathogen genomic surveillance, a key question is if antigenic properties can reliably be predicted from influenza virus genomic information. ...
Durazzi, F.
•
Koopmans, M.
•
Fouchier, R. A. M.
•
Remondini, D.
biorxiv
Wed Jul 02 2025
Multi-omic identification of key transcriptional regulatory programs during endurance exercise training in rats
Transcription factors (TFs) play a key role in regulating gene expression. We conducted an integrated analysis of chromatin accessibility, DNA methylation, mRNA expression, protein abundance and phosphorylation across eight tissues in fifty rats of equally represented sexes following endurance exercise training (EET) to identify coordinated epigenomic and transcriptional changes and determine key ...
Smith, G. R.
•
Zhao, B.
•
Lindholm, M. E.
•
Raja, A.
...•
Sealfon, S. C.
biorxiv
Wed Jul 02 2025
spRefine Denoises and Imputes Spatial Transcriptomics with a Reference-Free Framework Powered by Genomic Language Model
The analysis of spatial transcriptomics is hindered by high noise levels and missing gene measurements, challenges that are further compounded by the higher cost of spatial data compared to traditional single-cell data. To overcome this challenge, we introduce spRefine, a deep learning framework that leverages genomic language models to jointly denoise and impute spatial transcriptomic data. Our r...
Liu, T.
•
Huang, T.
•
Jin, W.
•
Chu, T.
...•
Zhao, H.
biorxiv
Wed Jul 02 2025
microRNA-206 is a reproducibly sensitive and specific plasma biomarker of amyotrophic lateral sclerosis
Amyotrophic lateral sclerosis (ALS) is a devastating and fatal neurodegenerative disease with no current therapeutic to modify disease progression. Reliable biomarkers for ALS are essential for improving diagnosis and evaluating therapeutic efficacy. We combined small-RNA sequencing from a discovery cohort of ALS patients and healthy controls with sequencing data from a previously published ALS co...
Henderson, B. W.
•
Roberts, B. S.
•
Kolodziejczak, S.
•
Cohcran, M.
•
Myers, R. M.
biorxiv
Wed Jul 02 2025
Haplotype-Resolved DNA Methylation at the APOE Locus identifies Allele-Specific Epigenetic Signatures Relevant to Alzheimer's Disease Risk
The APOE gene encodes a key lipid transport protein and plays a central role in Alzheimer\'s disease (AD) pathogenesis. Three common APOE alleles, {epsilon}2 (rs7412(C>T), {epsilon}3 (reference), and {epsilon}4 (rs429358(T>C)), arise from two coding variants in exon 4 and confer distinct AD risk profiles, with {epsilon}4 increasing risk and {epsilon}2 providing protection. The {epsilon}3-linked AP...
Genner, R. M.
•
Meredith, M.
•
Moller, A.
•
Weller, C.
...•
Billingsley, K. J.