PhD Defense: Connecting Documents, Words, and Languages using Topic Models

Weiwei Yang
05.09.2019 11:00 to 13:30
IRB 4105

Topic models discover latent topics in documents and summarize documents at a high level. To improve topic models' topic quality and extrinsic performance, external knowledge is often incorporated as part of the generative story. One form of external knowledge is weighted text links that indicate similarity or relatedness between the connected objects. This dissertation 1) uncovers the underlying structures in observed weighted links and integrates them into topic modeling, and 2) learns latent weighted links from other external knowledge to improve topic modeling.We consider incorporating links at three different levels: documents, words, and topics. We first look at binary document links, e.g., citation links of papers. Document links indicate topic similarity of the connected documents. Past methods model the document links separately, ignoring the entire link density. We instead uncover latent document blocks in which documents are densely connected and tend to talk about similar topics. We introduce LBH-RTM, a relational topic model with lexical weights, block priors, and hinge loss. It extracts informative topic priors from the document blocks for documents' topic generation. It predicts unseen document links with block and lexical features and hinge loss, in addition to topical features. It outperforms past methods in link prediction and gives more coherent topics.Like document links, words are also linked, but usually with real-valued weights. Word links are known as word associations and indicate the semantic relatedness of the two connected words. They provide more information about word relationships in addition to the co-occurrence patterns in the training corpora. To extract and incorporate the knowledge in word associations, we introduce methods to find the most salient word pairs. The methods organize the words in a tree structure, which serves as a prior, i.e., tree prior, for tree LDA. The methods are straightforward but effective, yielding more coherent topics than vanilla LDA, and slightly improving the extrinsic classification performance.Weighted topic links are different. Topics are latent, so it is difficult to obtain ground-truth topic links, but learned weighted topic links could bridge the topics across languages. We introduce a multilingual topic model (MTM) that assumes each language has its own topic distributions over the words only in that language and learns weighted topic links based on word translations and words' topic distributions. It does not force the topic spaces of different languages to be aligned and is more robust than previous MTMs that do. It outperforms past MTMs in classification while still giving coherent topics on less comparable and smaller corpora.
Examining Committee:

Co-Chairs: Dr. Philip Resnik & Dr. Jordan Boyd-Graber Dean's rep: Dr. Douglas Oard Members: Dr. Marine Carpuat Dr. Max Leiserson Dr. Mark Dredze