PhD Proposal: Entity-centric Understanding of Long Documents

Abhilasha Sancheti
10.11.2023 10:00 to 12:00

Entities and events are the building blocks of language that give language its richness and expressiveness be it in everyday conversations, narratives, news articles, biographies, or legal contracts. Most of these documents (such as, novels, and legal contracts) are centered around entitles (such as, people) containing rich information about them, the events that they participate in, and their interactions with other entities, making it challenging for readers to comprehend, and find specific information they need. Understanding of documents from an entity-centric perspective (i.e., who is participating in an event, what are its attributes, and relationships), as opposed to that of event-centric (i.e., what happens), can improve comprehension and extraction of information from such documents enabling development of practical applications to serve the information needs of readers.Humans make sense of text by using their linguistic knowledge combined with world knowledge and contextual information enabling them to understand not only explicitly stated but also implied information from the text. In this proposal, we focus on documents across two domains (legal contracts and narratives) and people entities as most of the content in these documents is centred around them. We study both explicit and implicit, static or evolving information related to entities who can participate in either a single or multiple (related or unrelated) events. We contribute by designing new tasks, collecting datasets, and proposing models covering these aspects of information to improve comprehension of long documents.In the first completed work, we systematically investigate the presence and accessibility of implicit script knowledge (used by humans to understand and reason about events that an entity can participate in given a scenario) in pre-trained large language models from a protagonist's perspective via a proposed generation task. Based on our findings that show that these models have limited script knowledge, we propose a script induction framework that is shown to mitigate the issues of mostly omitted, irrelevant, repeated or misordered events, we propose a method that is shown to produce meaningful prototypical sequence of events mitigating these errors.In the second completed work, we focus on a setting where multiple entities are involved in an event that may not have happened but is necessary or possible to happen. Taking legal contracts as a test case, we introduce tasks and a dataset to identify contracting party-specific obligations, entitlements, prohibitions, and permissions (known as, deontic modalities) in lease agreements. We show that transformer-based models trained on this dataset can accurately perform the task demonstrating that the diversity of expressing such modalities is learnable from our dataset. In our final completed work, we extend our previous work by introducing a task to generate a contracting party-specific extractive summary of the most important obligations, entitlements, and prohibitions in a contract. We collect a dataset of party-specific importance ordering (implicit information) among sentences belonging to these categories in a contract and propose a pipeline-based summarization system to handle the data annotation and long context modeling challenge associated with contract-level summary annotation collection and generation task.Having designed models that can generate protagonist-oriented prototypical sequence of events that happen in a scenario, extract explicit and implicit static information related to entities from unstructured text to a structured form in the legal domain, in our proposed work, we aim to model the fine-grained evolution (dynamic) in interpersonal relationship between entities interacting with each other over a sequence of events, in a much broader world of narratives such as novels. Additionally, we plan to explore whether relationship between entities are identifiable because they are stated explicitly or governed implicitly by “relationship” scripts which describe the norms and expectations of people in a particular interpersonal relationship.

Examining Committee


Dr. Rachel Rudinger

Department Representative:

Dr. Abhinav Shrivastava


Dr. Hal Daumé

Dr. Balaji Vasan Srinivasan (Adobe Research)