PhD Proposal: Search Among Sensitive Content

Mahmoud Sayed
12.12.2019 14:00 to 16:00

IRB 5105

Current search engines are designed to find what we want. But with the rapid growth of data, unprocessed archival collections can't be made available for search engines if they contain sensitive content that needs to be protected. Before release, content should be examined through a sensitivity review process, which becomes more difficult and time-consuming. Otherwise, the content provider will be under risk of disclosing sensitive content to the public. To make this process faster, search technology should be capable of providing access to relevant content while protecting sensitive content. The success of search of among secrets will help different applications to emerge, e.g. parental search, archival access to collections, protecting sensitive web pages over the web, and E-Discovery.In this proposal, we present an approach that leverages evaluation-driven information retrieval (IR) techniques. These techniques optimize an objective function that balances the value of finding relevant content with the imperative to protect sensitive information. This leads to designing a new evaluation metric that balances between relevance and sensitivity. Then, some baselines are introduced for addressing the current problem and a proposed approach that is based on building a listwise learning to rank (LtR) model. The resulting model is trained with a modified loss function to optimize for the new evaluation metric. In the experiments, one of the LETOR benchmark datasets, OHSUMED, is used with a subset of the Medical Subject Headings (MeSH) labels as a surrogate to represent the sensitive documents. Results show the efficacy of the proposed approach when evaluated using the new evaluation metric. This work leads to two challenges to be addressed by my future proposed work.First, our experiments were done on OHSUMED that contains medical documents and we used metadata from that collection to treat some categories as if they represented sensitive content. This motivates us to develop a new test collection that has realistic sensitive content, e.g. personal information, or private conversations. The target test collection should have 4 components: 1) set of documents, 2) search topics which represent information needs, 3) relevance judgments, and 4) sensitivity annotations. We propose to work on corporate email datasets, e.g. Avocado. This test collection will help us understand the representation of sensitive content, and hence we can build a learning model to classify emails having sensitive information. The resulting learning model will be integrated with an LtR model to rank documents based on relevance and sensitivity.Second, our fully automatic approaches may be risky because they may still do mistakes by putting sensitive content in the search result list. However, since people are far better than machines at drawing inferences from running text. We propose an active learning strategy where an archivist intervenes to manually review content which the sensitivity classifier is most uncertain about. Assuming the archivist is perfect in deciding the relevance and sensitivity of a document, a relevant and non-sensitive document is sent as an additional result to the searcher trying to enhance his future queries. The archivist's feedback will enable sensitivity classifier to adapt to the sensitivities within the collection. To measure the goodness of the proposed system, we propose a new evaluation metric that measures the gain the searcher gets, by getting at least one relevant document, while minimizing the archivist's review effort.Examining Committee:

Chair: Dr. Douglas W. Oard Dept rep: Dr. Ashok Agrawala Members: Dr. Marine Carpuat