Hyper Recent

Genome-wide association studies (GWAS) are crucial to human genetics research, yet their stability and reproducibility are often questioned. This work describes, analyzes, and provides tools for overcoming reproducibility challenges in two highly popular components of GWAS: set-based (a) hypothesis testing and (b) effect size estimation. Specifically, we focus on how the set-based natures of (a) and (b) often fuel non-reproducible results due to differences in data processing pipelines that are rarely discussed. First, we describe the processing challenges in a statistical model misspecification framework. Second, we analytically calculate the differences in power and amounts of bias that can arise in (a) and (b), respectively, due to small data processing choices. Third, we provide tools for quantifying and avoiding the data processing obstacles in GWAS. We validate our analytical calculations through a simulation study, and we demonstrate the aforementioned challenges empirically through analysis of a whole-exome sequencing study of pancreatic cancer.

Large Impact of Genetic Data Processing Steps on Stability and Reproducibility of Set-Based Analyses in Genome-Wide Association Studies

Similar Papers

Similar Papers

biorxiv
Mon Jul 21 2025
Systematic optimization of Caenorhabditis elegans cryopreservation
Caenorhabditis elegans (C. elegans) is a non-parasitic roundworm widely utilized as a versatile model organism for studying fundamental biological processes. Despite the availability of multiple cryopreservation methods, variations in the selection of developmental stage, cryoprotectant composition, and storage conditions may sometimes cause inconsistencies and uncertainty among researchers. In th...
Agrawal, S.
•
Karharia, A.
•
Rajendra Babu, K.

biorxiv
Mon Jul 21 2025
CAKUT variants in PRPF8, DYRK2, and CEP78: implications for splicing and ciliogenesis
Introduction: Congenital anomalies of the kidney and urinary tract (CAKUT) are the leading cause of chronic kidney disease in children and young adults. Although over 50 monogenic causes have been identified, many remain unresolved. PRPF8 is a core spliceosome component, essential for pre-mRNA splicing, and further localizes to the distal mother centriole to promote ciliogenesis. Methods: We perfo...
Merz, L. M.
•
Shril, S.
•
Carrocci, T. J.
•
Rezi, C. K.
...•
Hildebrandt, F.

biorxiv
Mon Jul 21 2025
Computer prediction and genetic analysis identifies retinoic acid modulation as a driver of conserved longevity pathways in genetically-diverse Caenorhabditis nematodes
Aging is a pan-metazoan process with significant consequences for human health and society--discovery of new compounds that ameliorate the negative health impacts of aging promise to be of tremendous benefit across a number of age-based comorbidities. One method to prioritize a testable subset of the nearly infinite universe of potential compounds is to use computational prediction of their likely...
Banse, S. A.
•
Sedore, C. A.
•
Coleman-Hulbert, A.
•
Johnson, E.
...•
Phillips, P. C.

biorxiv
Mon Jul 21 2025
BICC1 Interacts with PKD1 and PKD2 to Drive Cystogenesis in ADPKD
Autosomal dominant polycystic kidney disease (ADPKD) is primarily of adult-onset and caused by pathogenic variants in PKD1 or PKD2. Yet, disease expression is highly variable and includes very early-onset PKD presentations in utero or infancy. In animal models, the RNA-binding molecule Bicc1 has been shown to play a crucial role in the pathogenesis of PKD. To study the interaction between BICC1, P...
Tran, U.
•
Streets, A. J.
•
Smith, D.
•
Decker, E.
...•
Wessely, O.

biorxiv
Mon Jul 21 2025
What can Y-DNA analysis reveal about the surname Hay?
The family name Hay (plus associated spelling variants) is a prominent Anglo-Norman-in-origin surname that has been well-documented as a Scottish noble lineage since the 12th century CE. Their historical significance, linked to the rise of the Anglo-Norman era (1093-1286 CE) in Scotland, and the historical complexities of surname adoption post-Norman conquest of England, justifies the need for a c...
Stead, P.
•
Haddrill, P. R.
•
Macdonald, A. F.

biorxiv
Mon Jul 21 2025
Massively Parallel Polyribosome Profiling Reveals Translation Defects of Human Disease-Relevant UTR Mutations
The untranslated regions (UTRs) of mRNAs harbor regulatory elements influencing translation efficiency. Although 3.7% of disease-relevant human mutations occur in UTRs, their exact role in pathogenesis remains unclear. Through metagene analysis, we mapped pathogenic UTR mutations to regions near coding sequences, with a focus on the upstream open reading frame (uORF) initiation site. Subsequently,...
Li, W.-P.
•
Su, J.-Y.
•
Chang, Y.-C.
•
Wang, Y.-L.
...•
Lin, C.-L.

biorxiv
Mon Jul 21 2025
Genetic Modulation of Lifespan: Dynamic Effects, Sex Differences, and Body Weight Trade-offs
The dynamics of lifespan are shaped by DNA variants that exert effects at different ages. We have mapped genetic loci that modulate age-specific mortality using an actuarial approach. We started with an initial population of 6,438 pubescent siblings and ended with a survivorship of 559 mice that lived to at least 1100 days. Twenty-nine Vita loci dynamically modulate the mean lifespan of survivorsh...
Arends, D.
•
Ashbrook, D. G.
•
Roy, S.
•
Lu, L.
...•
Williams, R. W.

biorxiv
Mon Jul 21 2025
Applying gradient tree boosting to QTL mapping with Shapley additive explanations
Mapping quantitative trait loci (QTLs) is one of the major goals of quantitative genetics; however, identifying the interactions between QTLs (i.e., epistasis) remains challenging. Recently developed machine learning methods, such as deep learning and gradient boosting, are transforming the real world. These methods could advance QTL mapping methodologies because of their high capability for captu...
Ishibashi, T.
•
Onogi, A.

biorxiv
Mon Jul 21 2025
WISER: an innovative and efficient method for correcting population structure in omics-based prediction and selection
This work introduces WISER (whitening and successive least squares estimation refinement), an innovative and efficient method designed to enhance phenotype estimation by addressing population structure. WISER outperforms traditional methods such as least squares (LS) means and best linear unbiased prediction (BLUP) in phenotype estimation, offering a more accurate approach for omics-based selectio...
Jacquin, L.
•
Guerra, W.
•
Lewandowski, M.
•
Patocchi, A.
...•
Muranty, H.