PhD Proposal: The Missing Why: Building Generative AI That Understands Purpose, Audience, and Context

Talk
Ishani Mondal
Time: 
12.15.2025 10:00 to 12:00

Modern generative AI systems can draft fluent paragraphs, sketch crisp diagrams, and assemble professional-looking presentations. But can they truly support a researcher preparing for a high-stakes scientific talk—one that demands rapid shifts in intent, adaptation to diverse audiences, and precise multimodal reasoning? I argue “Not yet”. Even the most capable models fail not because their language is clumsy or their images are noisy, but because they lack an understanding of why content is being created, for whom, and how it must evolve as communicative goals change.

As a first step, I have introduced ADAPTIVE IE, a human-in-the-loop framework for flexible, intent-driven information extraction. Rather than relying on fixed schemas or predefined templates, ADAPTIVE IE lets users extract domain-relevant information on the fly. For instance, in a natural disaster scenario, a user might search for “evacuation routes” without using formal query language. ADAPTIVE IE uses large language models (LLMs) to propose relevant questions, groups the answers by meaning, and allows users to refine the clusters interactively—enabling real-time structuring of information based on what the user is trying to achieve. While ADAPTIVE IE addresses what information users seek, effective communication also hinges on who the audience is. In real-world settings, audiences are rarely monolithic. To model this variability, we introduce Group Preference Alignment (GPA)—a framework that captures the preferences of diverse user personas. Here, a persona represents an abstraction of an audience member’s background, expectations, and communicative goals. GPA aggregates potentially conflicting feedback from such personas into a unified, compromise-aware representation. By modeling preference overlap, disagreement, and summarizing tradeoffs, it enables generation that is sensitive to group-level communicative intent. This audience modeling becomes especially important when generating scientific content intended for varied recipients. We apply it in our method of Persona-Aware Slide Generation, where the same scientific input (research paper) is adapted differently for novices and expert audiences. However, linguistic fluency alone is insufficient for effective scientific communication. Many scientific ideas are inherently visual or structurally complex, requiring not just accurate language but also precise visual and spatial representations. To address this, I have introduced SciDoc2Diagrammer-MAF, which builds faithful, complete and aesthetically pleasing diagrams from scientific text through a feedback-driven refinement loop. By detecting and correcting hallucinated or omitted elements, the system ensures that visualizations remain tightly grounded in the source material. Yet visual accuracy is only part of the challenge—scientific communication also depends on well-structured layouts that adapt gracefully to edits. To this end, I have extended the notion of feedback-guided refinement to structural layout editing through SMART-Editor. When the users revise documents such as inserting new results sections or reordering content—SMART-Editor ensures that changes cascade logically, preserving spatial alignment, narrative coherence, and cross-sectional consistency.
Across these systems, I show that successful scientific communication requires the joint modeling of intent, audience, and multimodal structure—and that failing to model any one of these dimensions leads current generative systems to produce outputs that may look polished but communicate poorly.
However, building better generative systems alone is not enough. We must also systematically measure whether these systems genuinely help people learn, understand, and create. In the proposed work, I extend my dissertation toward evaluating and optimizing the real-world usefulness of multimodal generative AI: a) Goal-Driven Utility Modeling: I will develop a framework that will evaluate multimodal content generation systems based not on surface aesthetics, but on their ability to improve learning and decision-making, b) Multimodal Personalization at Scale: I will study whether tailoring text, diagrams, layouts, and narrative structure to user personas improves both creators’ workflows and audiences’ comprehension and c) Unified SuperPersonalization Benchmark: I will build a benchmark that will let the models learn stable, transferable representations of user preferences—enabling personalized generation that generalizes across tasks, modalities, and domains.