CS Ph.D. Student Explores Controllable Generative AI Through Storytelling

Taewon Kang's research examines how generative models can produce coherent, structured, and controllable narratives and content across scenes, characters, and diverse media formats.
Descriptive image for CS Ph.D. Student Explores Controllable Generative AI Through Storytelling

A short video clip may capture a moment, but building a full narrative with consistent characters and dialogue remains a challenge for artificial intelligence systems. At the University of Maryland, one doctoral student is working to extend those capabilities by focusing on story-driven content across scenes.

Taewon Kang, a Ph.D. student in the Department of Computer Science, studies generative artificial intelligence with a focus on controllable and multimodal generation. Advised by Distinguished University Professor Ming C. Lin, Kang explores how diffusion-based models integrate visual scenes, character behavior and language to produce consistent outputs across video, documents and other media.

Kang said his work centers on generative models, including video and image diffusion systems. While recent advances can produce high-quality visuals from structured prompts, these systems often do not meaningfully incorporate character-driven dialogue or speech.

“My research focuses on generative AI for storytelling, where dialogue is part of a broader narrative system,” Kang said. “We aim to create long-form content that links scenes, characters and actions.”

Although text-to-video systems have improved in recent years, they typically generate clips that last only a few seconds. These outputs can lack continuity, with inconsistencies in character appearance or scene progression, limiting their use in longer narratives.

Kang presented his work during a recent GAMMA Visit Day, where prospective Ph.D. students visited the department and engaged with ongoing research. He demonstrated a system that generates multi-scene video narratives from structured prompts. Each scene is guided by paired inputs describing the setting and character behavior, while his Action2Dialogue framework produces character-aware dialogue and speech aligned with the visual context.

Maintaining continuity across independently generated scenes remains a central challenge in this area of research. When prior context is not preserved, details such as character identity or actions may shift from one segment to the next.

“The challenge is making sure each scene is connected,” Kang said. “If a model changes a character’s appearance or behavior, the story becomes difficult to follow.”

To address this, Kang developed a unified pipeline that connects visual generation, dialogue and speech within a single system. A vision-language model extracts semantic information from images and converts it into captions, which are combined with structured prompts to guide a language model in producing character-specific dialogue. The resulting dialogue then serves as the basis for speech generation, helping maintain alignment with the narrative.

A key component of this approach is a “recursive narrative bank,” which stores prior dialogue and contextual information and feeds it back into the model as the story progresses. Inspired by Script Theory in cognitive psychology, this method supports continuity across scenes by preserving narrative context over time.

Kang has extended these ideas beyond video storytelling. During an internship at Adobe Research, he applied similar principles to structured, multi-layered documents, exploring how traditionally static formats such as PDFs could become adaptive and content-aware.

He developed a framework for text-conditioned background generation in multi-page documents, enabling controlled visual enhancements while preserving layout integrity. He later expanded this work with trajectory-level control in diffusion, allowing foreground content to remain intact while maintaining stylistic consistency across pages.

Beyond applied systems, Kang has also examined fundamental challenges in generative AI. He introduced a formulation for handling linguistic negation, known as NEGATE, that treats it as a constraint on semantic guidance in diffusion models, enabling more precise outputs under complex linguistic conditions.

In parallel, he explored controllable visual generation from minimal inputs, including a method for generating camera-controlled novel views from a single image by combining diffusion models with pretrained 3D priors. This work was accepted to the AAAI 2026 Workshop on AI for Environmental Science.

Together, these projects reflect a broader research direction focused on improving control and consistency in generative models across video, documents and visual media. As these systems become more capable of handling multiple forms of content, they are beginning to move beyond short outputs toward more structured, long-form generation.

Kang said this shift could expand the use of generative AI in video production, digital storytelling and interactive media, including film and gaming, where longer, more cohesive narratives are essential.

“Things are moving quickly,” Kang said. “Models are starting to handle different types of content together more consistently, and we expect these systems to play a larger role in how stories are created and shared.”

—Story by Samuel Malede Zewdu, CS Communications

The Department welcomes comments, suggestions and corrections.  Send email to editor [-at-] cs [dot] umd [dot] edu.