Diverse cell types within a tissue assemble into multicellular structures to shape the functions of the tissue. These structural modules typically comprise specialized subunits, each performing unique roles. We present HRCHY-CytoCommunity, a graph neural network-based framework that leverages differentiable graph pooling and graph pruning to identify hierarchical multicellular structures in cell phenotype-annotated single-cell spatial maps. HRCHY-CytoCommunity ensures the robustness of the result by employing a hierarchical majority voting strategy. To extend the utility of HRCHY-CytoCommunity in cross-sample comparative analysis, an additional cell-type enrichment-based clustering can be used to generate a unified set of nested multicellular structures across all tissue samples. Using single-cell-resolution spatial proteomics and transcriptomics data, we demonstrate that HRCHY-CytoCommunity outperforms the existing state-of-the-art method in discerning coarse-grained tissue compartments (TCs) and fine-grained tissue cellular neighborhoods (TCNs). To further validate the utility of multi-level nested multicellular structures, we apply HRCHY-CytoCommunity to breast cancer data with clinical information and reveal that large-scale TCs and small-scale TCNs enable the stratification of patients with different prognoses in a hierarchical manner.