PhD Proposal: Effective and Efficient Search Models across Languages

Suraj Nair
11.18.2021 10:00 to 12:00

IRB 4105

Recent developments in transformers-based architecture have improved the quality of text ranking systems beyond what traditional information retrieval (IR) systems could achieve. While these techniques were built to support search in English, lesser emphasis has been placed on cross-language information retrieval (CLIR), where the query and document languages differ. In this proposal, we aim to extend these developments to build effective and efficient CLIR systems.First, we focus on improving the effectiveness. A well-known traditional approach to CLIR, Probabilistic Structured queries (PSQ), uses the translation probabilities to match query and document terms in different languages. These translation probabilities are typically estimated from a sentence-aligned corpus on a word-to-word basis without taking into account the context. Neural methods, by contrast, can learn to translate using the context around the words that serve as a basis for estimating context-dependent translation probabilities. To this end, we explore different ways of combining context-dependent translation probabilities with context- independent translation probabilities to improve the effectiveness of cross-language ranked retrieval.Retrieve-and-rerank pipelines have found widespread adoption in monolingual retrieval applications. Typically, the first stage of the pipeline consists of using traditional retrieval methods such as BM25 to find documents relevant to a query. This is followed by a reranking stage where a neural model reorders the documents found by the first-stage retrieval model. In this work, we explore building a similar pipeline for ad-hoc document ranking in Cross-Language Information Retrieval (CLIR).The use of large neural models to rerank multiple documents can be time consuming, so the reranking depth usually needs to be tuned to balance the tradeoff between effectiveness and efficiency. To develop efficient search system with low query latency, for our proposed work, we focus on representation-based models that involve matching query and document terms in a shared vector space. The key challenge lies in learning meaningful task-specific representations of queries and documents without trading off the overall effectiveness, and doing so in multiple languages in the context of CLIR..Examining Committee:

Chair:Department Representative:Members:

Dr. Douglas W. Oard Dr. John DickersonDr. Marine Carpuat