Sanjoy Chowdhury’s Vision for Smarter, Multimodal AI

The third-year Ph.D. student advances context-aware, efficient multimodal audio-visual learning for fine-grained understanding and reasoning.
Descriptive image for Sanjoy Chowdhury’s Vision for Smarter, Multimodal AI

Imagine an artificial intelligence system that can not only watch and listen at the same time, but also deeply understand and generate rich, coherent responses by fusing information from both sight and sound. That vision drives the research of University of Maryland Department of Computer Science Ph.D. student Sanjoy Chowdhury, whose work advances audio-visual representation learning, cross-modal generative modeling and the development of next-generation audio-visual large language models.

Beyond his academic work at Maryland, he has collaborated with industry research labs including Meta Reality Labs, Google DeepMind, Apple Research and Adobe Research to explore how machines perceive, understand and generate across modalities.

From Industry to Academia

Chowdhury joined UMD in 2022 after three years in industry, where he worked as a research engineer. While his role involved technically challenging and impactful problems, he sought greater freedom to pursue independent research directions.

“The desire for greater research freedom was one of the primary reasons I decided to pursue a Ph.D.,” Chowdhury said. “I wanted to work on problems that I could explore deeply, in my own way, and not be limited by short-term product goals.”

He also wanted the flexibility to chart his own project directions and collaborate with top researchers in both academia and industry. 

“In industry, you often work within a fixed roadmap,” he said. “A Ph.D. gives you the chance to ask your questions, experiment with ideas, and publish work that pushes the field forward.”

He had earned his bachelor’s and master’s degrees in India, where a strong interest in mathematics and statistics led him toward data mining and machine learning. That interest deepened during early research experiences in computer vision, working on projects such as video super-resolution, video quality assessment and computational photography. During 2019-2020, while working on core computer vision problems, he became interested in multimodal AI, which integrates vision with other modalities such as audio to enable a more comprehensive understanding.

UMD and Computer Vision

The strength of UMD's computer science program, especially in computer vision, made it a natural fit. Chowdhury joined the lab of Distinguished University Professor Dinesh Manocha, whose group has a culture of open collaboration and frequent partnerships with researchers worldwide.

Working in Manocha’s lab has provided him with opportunities to collaborate across universities and industry labs. These partnerships not only expand the scope of projects but also provide access to computing resources often unavailable in academic settings.

“The culture here allows us to choose our research direction and work with the people who can best support it,” he said. “Collaborations help us move faster and explore ideas we couldn’t pursue on our own.”

He said the ability to partner with outside organizations has been especially valuable. 

“In today’s AI research, access to large-scale computing is critical,” he said. “By working with industry collaborators, we can scale experiments that would be impossible to run with only academic resources.”

Current Focus

While continuing his doctoral work, Chowdhury is interning with Apple’s machine learning research team. There, his projects target multi-speaker audio-visual understanding, teaching AI systems to process conversations where multiple people speak and interact, using both sight and sound.

He described a scenario in which a user wants to find a specific speaker’s comments during a recorded meeting. The model must identify the correct person, track their speech visually and audibly and extract only the relevant segment.

“This is not just a transcription problem,” he said. “It’s about grounding what’s being said to who is saying it, at what point in the conversation, and under what context.”

This requires more than simple matching. Models must recognize people by appearance, align their lip movements with speech and understand context. Chowdhury’s work involves building benchmarks for these tasks, testing existing models and designing new ones that better handle multi-speaker scenarios.

Looking Ahead

As his research progresses, he is focused on bridging the gap between academic theory and deployable technology, ensuring that models can reason accurately and operate reliably across different types of information. This includes improving how systems break down multi-step questions, maintain accuracy despite conflicting inputs and process large amounts of multimodal data efficiently.

For Chowdhury, UMD’s academic environment and his industry collaborations have provided the space and resources to pursue these questions. As multimodal AI continues to evolve, he expects the demand for models that can integrate multiple modalities to grow, along with the need for rigorous benchmarks to measure their capabilities.

“My overall goal is to equip the next generation of AI models to understand and reason over multiple signals the way humans do,” he said. “If we can get there, we’ll have systems that are not just more capable, but also more reliable for the people who use them.”

—Story by Samuel Malede Zewdu, CS Communications 

The Department welcomes comments, suggestions and corrections.  Send email to editor [-at-] cs [dot] umd [dot] edu.