Diversity and Fairness in Data Summarization Algorithms

Talk
Sepideh Mahabadi
Talk Series: 
Time: 
03.05.2021 13:00 to 14:00

Searching and summarization are two of the most fundamental tasks in massive data analysis. In this talk, I will focus on these two tasks from the perspective of diversity and fairness. Search is often formalized as the (approximate) nearest neighbor problem. Despite an extensive research on this topic, its basic formulation is insufficient for many applications. In this talk, I will describe such applications and our approaches to address them. For example, we show how to incorporate diversity or fairness in the results of a search query. A prominent approach to summarize the data is to compute a small "core-set": a subset of the data that is sufficient for approximating the solution of a given task. We introduce the notion of "composable core-sets" as core-sets with the composability property: the union of multiple core-sets should form a good summary for the union of the original data sets. This composability property enables efficient solutions to a wide variety of massive data processing applications, including distributed computation (e.g. Map-Reduce model), streaming algorithms, and similarity search. We show how to produce such efficient summaries of the data while preserving the diversity in the data set. I will describe several metrics for capturing the notion of diversity, and present efficient algorithms for construction of composable core-sets with respect to those metrics.