PhD Defense: Context-Aware Computational Video Editing and Re-Editing

Talk
Pooja Guhan
Time: 
07.01.2025 10:00 to 12:00
Location: 

Video has emerged as the dominant medium for communication and creative expression in the digital era, fueled by advances in consumer cameras and the ubiquity of content-sharing platforms. This democratization has empowered creators from diverse backgrounds and made capturing videos now effortless, but editing remains a significant barrier. It demands both technical expertise and nuanced creative judgment in narrative structure, emotional tone, and audience engagement. Current artificial intelligence (AI)-driven tools offer automation for basic editing tasks but fall short in supporting the high-level creative decisions that define compelling video, often neglecting the narrative intent, production context, and viewer perception.We address this gap by introducing adaptive, expressive, and accessible editing techniques that bridge the gap between automation and artistic intent. We present computational models that support decision making across key stages of video editing, structured in three parts. The first part presents two context-aware image editing approaches. The first approach leverages reinforcement learning to automatically analyze images in a way that harmonizes with the broader design or narrative context, rather than applying uniform edits across diverse content. The second approach, TAME-RD, pioneers AI-based reverse designing to provide detailed breakdowns of editing operations and parameter strengths for easy style extraction and transfer. TAME-RD reported improvements of 6-10% on various accuracy metrics and 1.01X - 4X on the RMSE score on the GIER dataset. Additionally, we also introduced a new dataset, I-MAD. Together, these methods advance automated color grading, enabling personalized and contextually relevant workflows.The second part tackles the context-based adaptation of visual effects and camera motions to diverse narrative and stylistic goals. Our algorithm, V-Trans4Style, employs a transformer-based encoder-decoder and style conditioning module to generate visually seamless, temporally consistent transitions tailored to targeted production styles, significantly outperforming prior methods. On the AutoTransition dataset, V-Trans4Style achieved improvements of 10%-80% in Recall@K and mean rank values over baselines. We also introduced the AutoTransition++ dataset. Complementing this, another algorithm, CamMimic, introduces a zero-shot algorithm leveraging video diffusion models to transfer camera motion patterns from reference videos to new scenes, allowing creators to emulate complex camera work without additional data or 3D information. Both approaches received strong user preference (at least 70%), underscoring their effectiveness in empowering creative video editing.The third part focuses on the edit refinement process based on audience feedback as new context to guide iterative editing decisions, helping creators identify impactful moments and enhance future content delivery. To solve the challenge of reliably quantifying audience engagement, we present a machine learning–based approach to estimate viewer engagement levels during video playback, drawing on psychological theories of attention and interaction. Our method has been validated through real-world experiments, including an application in telehealth for mental health, where the system automatically assessed patient engagement from video sessions. We obtained a 40% improvement in evaluation metrics over state-of-the-art methods for engagement estimation. By enabling objective, automated measurement of engagement, this approach empowers editors to make data-driven refinements, ultimately improving the effectiveness and resonance of video content.