PhD Proposal: Interactive Machine Learning for Low-Resource Languages

Mozhi Zhang
01.21.2021 10:00 to 12:00


Modern machine learning methods in natural language processing (NLP) can learn accurate classifiers from large labeled corpora. Unfortunately, many languages are left behind due to the scarcity of textual data. In this proposal, we bring a human in the loop to quickly improve NLP models for low-resource languages.To bridge the resource gap across languages, one popular strategy is to use cross-lingual word embeddings (CLWE) as features. CLWE allow models to be trained in a high-resource language (such as English) and predict in a low-resource language. Recent CLWE methods follow a projection-based pipeline: independently train monolingual word embeddings and align them with orthogonal projections.We identify two problems with this pipeline. First, orthogonal mapping requires the monolingual embedding spaces to be approximately isomorphic, which does not always hold. To improve the suitability of orthogonal alignment, we introduce a preprocessing technique with theoretical justifications. Second, recent work in CLWE focus almost exclusively on bilingual lexicon induction (BLI), but BLI scores do not always correlate with downstream task accuracy. We explain the mismatch and introduce a post-processing technique that helps downstream models.Inspired by these findings, we build clime, a system that asks a bilingual speaker to tailor pre-trained CLWE for a downstream task. clime works in three steps. First, we extract a list of keywords for the given task. Next, a user marks word similarities between each keyword and its nearest neighbors in the embedding space. Finally, we update the embeddings to reflect the feedback. Empirically, users can significantly improve a cross-lingual classifier in thirty to sixty minutes.Another limitation of previous work is the common assumption that annotations are only available in English. While it is hard to find annotators for low-resource languages, annotators may be available in an orthographically related language. Transferring between related language pairs is easier because they often share scripts, cognates, and morphological patterns. To exploit these subword similarities, we use a character-based model with shared character representations. We use multi-task learning to learn from different types of feedbacks, including word translations, parallel sentences, labeled documents.For proposed work, we investigate two directions. First, we introduce a cross-lingual reply suggestion dataset. Previous work in cross-lingual NLP focuses on classification and sequence labeling tasks. Reply suggestion is more challenging and requires open-ended text output. Second, we improve information retrieval in low-resource languages with human feedbacks.Examining Committee:

Chair: Dr. Jordan Boyd-Graber Dept rep: Dr. Huaishu Peng Members: Dr. Philip Resnik