TTIC Summer Workshop: Data Center Scheduling from Theory to Practice

Time: July 30 - August 2 (Mon-Thu)

Location: TTI-Chicago

Organizers: Leana Golubchik , Samir Khuller , Barna Saha , Cliff Stein

Program: TTI-Chicago Summer Workshop Program

Main Outcomes: A closer collaboration between systems and theory researchers, bringing algorithms designers to work more closely with systems researchers, to understand the “gaps” and shortcomings with proposed solution approaches, with a focus on developing more collaborations to bridge such gaps. Expected benefits include - development of new problem domains, issues and constraints, and potentially usable solutions.

Tentative Schedule

Sunday
Dinner and reception (information will be emailed to attendees shortly)
Monday
8:30 a.m. Meet at hotel to take the bus to TTIC
9:45 a.m. - 10:00 a.m. Opening remarks; discussion of workshop goals
10:00 a.m. - 10:45 a.m. SOAP: One Clean Analysis of all Age-based Scheduling Policies by Mor Harchol-Balter
11:00 a.m. - 11:45 a.m. On Some Stochastic Load-Balancing Problems by Anupam Gupta
12:00 p.m. - 4:00 p.m. Lunch and working groups
4:00 p.m. - 4:15 p.m. Systems Algorithmics for Next-Generation Data Center Networks by Rachit Agarwal
4:20 p.m. - 4:35 p.m. Algorithms for Right-Sizing Data Centers by Susanne Albers
4:40 p.m. - 4:55 p.m. Toward Unified, Transparent Memory Disaggregation by Mosharaf Chowdhury
5:00 p.m. - 5:15 p.m. Programming the Topology of Networks: Technology and Algorithms by Monia Ghobadi
5:30 p.m. Bus back to hotel
7:30 p.m. Dinner
Tuesday
8:45 a.m. Meet at hotel to take the bus to TTIC
10:00 a.m. - 10:45 a.m. Scheduling Jobs With Dependecies by Janardhan Kulkarni
11:00 a.m. - 11:45 a.m. A generalized blind scheduling policy by Vishal Misra
12:00 p.m. - 4:00 p.m. Lunch and working groups
4:00 p.m. - 4:15 p.m. Expanders as Datacenter Topologies by Michael Dinitz
4:20 p.m. - 4:35 p.m. Scheduling Parallel DAG Jobs Online by Benjamin Moseley
4:40 p.m. - 4:55 p.m. Non-clairvoyant job scheduling with prediction by Zoya Svitkina
5:00 p.m. - 5:15 p.m. Online Load Balancing on Related Machines by Sungjin Im
5:30 p.m. Bus back to hotel
7:30 p.m. Dinner on your own
Wednesday
8:45 a.m. Meet at hotel to take the bus to TTIC
10:00 a.m. - 10:45 a.m. Using PubSub for scheduling in Data Center by Qi Zhang
11:00 a.m. - 11:15 a.m. Coflow Scheduling by Manish Purohit
11:20 a.m. - 11:35 a.m. Scheduling Flows and Co-flows in Networks by Rajmohan Rajaraman
11:40 a.m. - 11:55 a.m. Algorithms for Dynamic NFV Workload by Seffi Naor
12:00 noon - 12:15 a.m. Sincronia: Coflow scheduling from practice to theory and back! by Shijin Rajakrishnan
12:20 p.m. - 4:00 p.m. Lunch and working groups
4:00 p.m. Bus back to hotel
5:00 p.m. Social event
Thursday
8:45 a.m. Meet at hotel to take the bus to TTIC
10:00 a.m. - 10:45 a.m. Vector Scheduling by Debmalya Panigrahi
11:00 a.m. - 11:15 a.m. Job delay analysis in data centers by Weina Wang
11:20 a.m. - 11:35 a.m. Domain Wall Memory Management by Kirk Pruhs
11:40 a.m. - 11:55 a.m. Scheduling with Network Constraints: Open Problems by Barna Saha
12:00 p.m. - 1:00 p.m. Lunch
1:00 p.m. - Time to collaborate.

Overview:

Modern data centers form the backbone on which all cloud services run. Most modern applications today are running on a data center. Data Centers themselves are incredibly complex with hundreds of thousands of machines, all connected by high speed networks, running thousands of applications, each running on hundreds of machines. Given the rapid pace of development, many of the central scheduling and resource management problems have been hurriedly solved, with quick and dirty solutions without careful evaluation of the efficacy of these methods. Algorithms to better manage the scale and complexity are critical to impacting scheduling and resource allocation policies, and in turn, the efficiency of the resource management policies drives both user happiness (e.g., response time) as well as running costs (e.g., energy usage), both critically important issues for the future. There are several questions that need to be addressed such as job scheduling across multiple clusters (or multiple data centers), sharing resources among competing applications, management of communication and I/O needs, and multi-resource aware job placement.

This workshop will bring together a team of researchers with complementary skills, both from theoretical computer science and systems with the ultimate goal of designing scheduling and resource allocation policies for the next generation of Data center Resource Management Systems (DRMS). While there is a huge scheduling literature in theory community, the success of translating the theoretical results to real systems have been limited. On one hand, a deeper understanding within the theory community of the real issues and constraints facing system builders and designers and building models for better modeling modern applications is required. On the other hand, system builders need to consider a more principled approach of tackling scheduling problems and using optimal, or close to optimal strategies that can significantly impact the design of heuristics used in resource management.

The workshop will give a unique opportunity to both the communities to spend a week together, identify the major issues, start collaboration and solve challenging technical questions. Data center scheduling is a complex problem, and many scheduling questions that we consider solved in theory needs to be revisited in light of it. Here we point out a few example questions that we propose to address during the workshop.

  1. Understanding performance of various competing algorithms on different workloads, going beyond the traditional worst-case guarantees studied in theoretical computer science. Are there any observable properties about the inputs that might allow the algorithm designer to develop significantly improved methods? While some average-case performances have been studied, the goal will be to significantly extend the reach of those methods. Considering just the network alone or just the CPU is often not the right strategy; instead, we want to consider all resources at the same time. The key problem arises from the fact that network is a shared and fungible resource, whereas CPU and memory are not (i.e., they are independent and non-fungible).
  2. The quality of task placement often dictates the quality of scheduling. This is especially true when placement and scheduling decisions are completely isolated, which is often the case. For example, EC2 allocates machines first before the scheduler makes scheduling decisions. Because of a bad placement, we may be forced to overload the communication network. At the same time, coordinating the communication requirements/coflows among applications can result in significantly better utilization of resources.

Participants

Participants

  • Rachit Agarwal (Cornell)
  • Saksham Agarwal (Cornell)
  • Susanne Albers (TUM)
  • Mosharaf Choudhury* (Michigan)
  • Julia Chuzhoy (TTI)
  • Michael Dinitz (JHU)
  • Monia Ghobadi (Microsoft)
  • Leana Golubchik (USC)
  • Albert Greenberg (Microsoft)
  • Anupam Gupta (CMU)
  • Mor Harchol Balter (CMU)
  • Longbo Huang (Tsinghua)
  • Sungjin Im (UC Merced)
  • Samir Khuller (UMD)
  • Janardhan Kulkarni (U Minnesota)
  • Thomas Lavastida (CMU)
  • Roei Levin (CMU)
  • Kunal Mahajan (Columbia)
  • Biswaroop Mait (NorthEastern U)
  • Sai Mali (Columbia)
  • Vishal Misra (Columbia)
  • Benjamin Moseley (CMU)
  • Seffi Naor (Technion)
  • Yasamin Nazari (JHU)
  • Debmalya Panigrahi (Duke)
  • Kirk Pruhs (Pittsburgh)
  • Manish Purohit (Google)
  • Shijin Rajakrishnan (Cornell)
  • Rajmohan Rajaraman (NorthEastern U)
  • Barna Saha (UMass Amherst)
  • David Shmoys (Cornell)
  • Cliff Stein (Columbia)
  • Zoya Svitkina (Google)
  • Weina Wang (UIUC)
  • Sheng Yang (UMD)
  • Qi Zhang (Microsoft)
  • Zhao Zhang (UIUC)