Humans can evaluate the emotional meaning of complex social scenes in real-life settings. Recent evidence from the human brain and AI models pointed to visual processing as a core substrate for representing emotional valence of natural images, but this conclusion may not generalize to complex social scenes. We implemented experiments with social scenes in which emotional valence is partially dissociated from visual characteristics, objects, and scene settings. Human behavior, neuroimaging, and visual AI models confirm that visual processing captures basic emotional associations of objects and scene elements. However, when the valence of social scenes is incongruent with these basic properties, higher levels of processing are needed in the human association cortex and AI models. Our results show how and when valence processing demands advanced cognitive, neural, and computational processes that extend beyond the encoding of visual features.