The implicit goal of longitudinal observations of animal groups is to identify individuals and to reliably detect their behaviors, including their vocalizations. Yet, to segment fast behaviors and to extract individual vocalizations from sound mixtures remain challenging problems. Promising approaches are multimodal systems that record behaviors with multiple cameras, microphones, and animal-borne wireless sensors. The instrumentation of these systems must be optimized for multimodal signal integration, which is an overlooked steppingstone to successful behavioral tracking. We designed a modular system (BirdPark) for simultaneously recording small animals wearing custom low-power frequency-modulated radio transmitters. Our custom software-defined radio receiver makes use of a multi-antenna demodulation technique that eliminates data losses due to radio signal fading and that increases the signal-to-noise ratio of the received radio signals by 6.5 dB compared to best single-antenna approaches. Digital acquisition relies on a single clock, allowing us to exploit cross-modal redundancies for dissecting rapid behaviors on time scales well below the video frame period, which we demonstrate by reconstructing the wing stroke phases of free-flying songbirds. By separating the vocalizations among up to eight vocally interacting birds, our work paves the way for dissecting complex social behaviors.