Humans effortlessly distinguish speech from music, but the neural basis of this categorization remains debated. Here, we combined intracranial recordings in humans with behavioral categorization of speech or music taken from a naturalistic movie soundtrack to test whether the encoding of spectrotemporal modulation (STM) acoustic features is sufficient to categorize the two classes. We show that primarily temporal and primarily spectral modulation patterns characterize speech and music, respectively, and that non-primary auditory regions robustly track these acoustic features over time. Critically, cortical representations of STMs predicted perceptual judgments gathered in an independent sample. Finally, speech- and music-related STM representations showed stronger tracking of category-specific acoustical features in the left versus right non-primary auditory regions, respectively. These findings support a domain-general framework for auditory categorization, such that perceptual categories emerge from efficient neural coding of low-level acoustical cues, highlighting the computational relevance of STMs in shaping auditory representations that support behavior.