The emergence of genomic language models (gLMs) has revolutionized the analysis of genomic sequences, enabling robust capture of biologically meaningful patterns from DNA sequences for an improved understanding of human genome-wide regulatory programs, variant pathogenicity and therapeutic discovery. Given that DNA serves as the foundational blueprint within the central dogma, the ultimate evaluation of a gLM is its ability to generalize across this entire biological cascade. However, existing evaluations lack this holistic and consistent framework, leaving researchers uncertain about which model best translates sequence understanding into downstream biological prediction. Here we present Genomic Touchstone, a comprehensive benchmark designed to evaluate gLMs across 36 diverse tasks and 88 datasets structured along the central dogma\'s modalities of DNA, RNA, and protein, encompassing 5.34 billion base pairs of genomic sequences. We evaluate 34 representative models encompassing transformers and convolutional neural networks, as well as emerging efficient architectures such as Hyena and Mamba. Our analysis yield four key insights. First, gLMs achieve comparable or superior performance on RNA and protein tasks compared to models pretrained on these molecules. Second, transformer-based models continue to lead in overall performance, yet efficient sequence models show promising task-specific capabilities and deserve further exploration. Third, the scaling behavior of gLMs remains incompletely understood. While longer input sequences and more diverse pretraining data consistently improve performance, increases in model size do not always translate into better results. Fourth, pretraining strategies, including the choice of training objectives and the composition of pretraining corpora, exert substantial influence on downstream generalization across different genomic contexts. Genomic Touchstone establishes a unified evaluation framework tailored to human genomics. By spanning multiple molecular modalities and biological tasks, it offers a valuable foundation for guiding future gLM design and understanding model generalizability in complex biological contexts.