Two frequently studied bursts of neural activity in the hippocampus are normal physiological ripples and abnormal interictal epileptiform discharges (IEDs). While they are different waveforms, IEDs are notoriously picked up as false positives when using typical automated ripples detectors which are prone to sharp edge artifacts. This has created challenges for studying ripples and IEDs independently. We leveraged recent advances in computer vision on time-frequency feature representations to enable more comprehensive and objective dissociation of these phenomena. We retrospectively evaluated human intracranial recordings from 46 hippocampal depth electrode sites among 17 patients with focal epilepsy, the majority of whom had a seizure-onset zone/network involving the hippocampus. We implemented a common human ripple detection algorithm and broadband spectrograms of all detected ripple candidates were projected into low-dimensional space. We segmented them using k-means to infer pseudo-labels for probable ripples and probable IEDs. Independently, human expert IED labels were manually annotated for comparison. State-of-the-art vision transformer models were implemented on individual spectrograms to approach ripple vs. IED dissociation as an image classification problem. We detected 31,847 ripple/IED candidates, and a median 3.9% per patient (range: 0-47.2%) were IEDs based on expert label overlap. Low-dimensional projection of spectrograms separated canonical IEDs vs. ripples better than raw or ripple-filtered waveforms. Canonical ripple and IED candidates emerged at opposite poles with a continuous landscape of intermediates in between. A binary vision transformer model trained on expert-labeled IED vs. non-IED candidate spectrograms with 5-fold cross-validation showed a mean area under the curve (AUC) of 0.970 and mean precision-recall curve of 0.694, both significantly above chance. To evaluate generalizability, we implemented a leave-one-patient-out cross-validation approach, in which training on pseudo-labels and testing on expert-labeled data demonstrated near-expert performance (mean AUC 0.966 across patients, range 0.892-0.997). Transformer-derived attention maps revealed that models were tuned to triangle-like edge artifact spatial features in the spectrograms. Model-derived probabilities (i.e. of being an IED) for all candidates demonstrated continuous transitions between ripples vs. IEDs, as opposed to binary clustering. The delineation between ripples and IEDs appears best represented as a gradient (i.e. not binary) due to physiological ripple features overlapping with sharpened and/or high frequency pathophysiological IED features. ViTs nevertheless perform virtually at human expert levels in dissociating these phenomena by leveraging time-frequency spatial features enabled by neural data spectrograms. Such tools applied to spectrotemporal representations may augment comprehensive investigations in cognitive neurophysiology and epileptiform signal biomarker optimization for closed-loop applications.