PhD Proposal: Towards Context-Aware and Efficient Audio-Visual Perception
IRB-5107 umd.zoom.us/my/dmanocha
In the rapidly advancing domain of artificial intelligence, integrating heterogeneous modalities, particularly audio and visual streams, has become essential for robust, context-aware understanding. Our work advances multimodal foundation models through computationally efficient and semantically aligned architectures capable of handling the complexity of real-world audio-visual tasks.
We introduce novel task formulations that require joint localization, interpretation, and synthesis of multimodal information under diverse conditions. Our models tightly couple fine-grained audio-visual grounding with adaptive reasoning mechanisms, improving temporal alignment, semantic consistency, and robustness to modality-specific noise. To address the lack of standardized evaluation, we design comprehensive benchmarks with tailored protocols and metrics, enabling rigorous and fair comparisons.
In this proposal, we also explore this direction by developing efficient multimodal models that combine parameter-efficient adaptation with computation-aware inference. Our approaches integrate lightweight adaptation techniques with policies that selectively process modalities, enabling models to retain high accuracy while significantly reducing training and deployment costs.