PhD Proposal: Long Context Multimodal Understanding
Learning from structured, semi-structured, and unstructured information contained in text, images, videos, documents, or a combination of such modalities is a crucial step in designing intelligent systems. Modern deep learning systems process the input data to extract patterns and underlying representations for downstream applications. Most real-world data is characteristically long-range and may require contextualization beyond a fixed input aperture (context window) for effective downstream applications. For instance, text documents may be composed of several paragraphs, often running into multiple pages; audio and video recordings may span from several minutes to more than a few hours; digital documents can have thousands of text tokens and embedded images arranged in specific layouts.With the advent of self-supervised learning, Transformers models have gained immense popularity across a large variety of tasks spanning natural language processing, computer vision, speech processing, document intelligence, and so on. They represent the state-of-the-art across many modalities, from language understanding and document intelligence to image classification and protein sequences. A common weakness of Transformers is their quadratic memory complexity within the self-attention mechanism, which restricts their potential application to domains requiring longer sequence lengths. Hence, these models and their associated methods struggle from an input length limitation for reasoning in long-context scenarios. Existing research has tried to propose extensions of the standard Transformer architecture (e.g., Longformer, Big Bird, Reformer, etc.) to encode longer input sequences. However, such methods are not task agnostic, trade the ability to model long-form input with reduced performance vis-à-vis regular Transformer models, do not show consistent performance gains across different tasks, and require extensive training data and computation power to make them usable for specialized domains such as legal, finance, news, and contracts with restricted supervised data resources.Our research focuses on building predictive models for long context (also called document-level) multimedia understanding by expanding the capabilities of the Transformer language models for capturing local-level context as well as long-range global information and is broadly divided into four parts: We research designing and training supervised methods for document-level text information extraction. We look at tasks such as temporal event relation extraction, temporal dependency parsing, and natural language inference at a document scale. The research explores multimodal hierarchical structure extraction in visually-rich documents and using visual-linguistic-spatial learning for automated document manipulations. We investigate methods for building text-to-speech systems for semi-structured long-form text and improving speech recognition systems for handling long-term dependencies to better predict words having domain-specific contexts. Lastly, the research covers methods to extract information from multimodal long-form videos (also called conference calls) for downstream time series prediction and see how document-level transcripts, long-form audio-visual recordings, and tabular information can be combined for financial prediction tasks.
Dr. Dinesh Manocha
Dr. Ming Lin
Dr. Sanghamitra Dutta
Dr. Rajiv Jain (Adobe Research)
Dr. Vlad Morariu (Adobe Research)