Recent advances in single-cell multi-omics have provided unprecedented insights into gene regulation by jointly profiling transcriptomic (scRNA-seq) and chromatin accessibility (scATAC-seq) landscapes. However, the inherent heterogeneity and high dimensionality of these multimodal data present significant challenges for effective integration and downstream analysis. Foundation models have demonstrated strong representation learning capabilities for scRNA-seq or scATAC-seq data. So far, however, no model has been specifically developed for the integrative analysis of these two modalities. Here, we introduce SCARF, a single cell ATAC-seq and RNA-seq foundation model. SCARF is pre-trained on X-Omics, the largest curated collection of single-cell multi-omics data to date, comprising over 2.7 million cells across multiple tissues and species. The model utilizes a Mamba architecture for efficiently capturing long-context relationships between genes and between accessible regions. Modality-specific and shared features are learned by the model through self-supervised learning and contrastive learning, respectively. SCARF achieves state-of-the-art performance on multiple downstream tasks, including cell representation, cell matching, and cross-omics translation. Furthermore, SCARF enables few-shot cell type annotation, demonstrating strong generalizability across previously unseen datasets. These results highlight the power of foundation models for advancing integrative analysis of single cell multi-omics data, with broad applications in important tasks including cellular characterization, gene or genomic perturbation analysis, and regulation network analysis.