Structural heterogeneity due to varying compostions and dynamic conformations of macromolecules in cellular environment in situ presents significant challenges to biological discovery by cryo-electron tomography (cryo-ET). In this paper, we present OPUS-TOMO, a unified deep learning framework for analyzing structural heterogeneity throughout the cryo-ET workflow. The method adopts a convolutional Encoder-Decoder architecture that adeptly maps real-space subtomograms onto a smooth low-dimensional latent space, allowing capture of the complete landscape of compositional and conformational variations of macromolecules in cryoET data. OPUS-TOMO also incorporates multiple effective techniques such as a per-particle 3D CTF model and an end-to-end pose correction network specifically for cryo-ET data. Applications of OPUS-TOMO to several real cryo-ET datasets demonstrate its outstanding capacities in resolving structural heterogeneity at various stages of cryo-ET reconstruction. The development of advanced methodologies such as OPUS-TOMO would potentially accelerate the quest for understanding biological mechanisms in situ on a proteomic scale at molecular levels in the broad tomography community. The software is available at https://github.com/alncat/opusTOMO.