TTIC Summer Workshop: Data Center Scheduling from Theory to Practice

Time: July 30 - August 2 (Mon-Thu)

Location: TTI-Chicago

Organizers: Leana Golubchik , Samir Khuller , Barna Saha , Cliff Stein

Program: TTI-Chicago Summer Workshop Program

Main Outcomes: A closer collaboration between systems and theory researchers, bringing algorithms designers to work more closely with systems researchers, to understand the “gaps” and shortcomings with proposed solution approaches, with a focus on developing more collaborations to bridge such gaps. Expected benefits include - development of new problem domains, issues and constraints, and potentially usable solutions.

Tentative Schedule

Sunday
6:00 p.m. - 8:00 p.m. Evening reception
Monday - Thursday (maybe end by 5 on Thu)
9:00 a.m. - 10:00 a.m. Plenary
10:00 a.m. - 10:30 a.m. coffee break
10:30 a.m. - 12:30 p.m. 4 short talks
12:30 p.m. - 3:30 p.m. Lunch break, working sessions (there will be lunch on Monday)
3:30 p.m. - 4:00 p.m. Tea/coffee
4:00 p.m. - 5:30 p.m. 3 short talks
6:30 p.m. Group dinner to be organized

Overview:

Modern data centers form the backbone on which all cloud services run. Most modern applications today are running on a data center. Data Centers themselves are incredibly complex with hundreds of thousands of machines, all connected by high speed networks, running thousands of applications, each running on hundreds of machines. Given the rapid pace of development, many of the central scheduling and resource management problems have been hurriedly solved, with quick and dirty solutions without careful evaluation of the efficacy of these methods. Algorithms to better manage the scale and complexity are critical to impacting scheduling and resource allocation policies, and in turn, the efficiency of the resource management policies drives both user happiness (e.g., response time) as well as running costs (e.g., energy usage), both critically important issues for the future. There are several questions that need to be addressed such as job scheduling across multiple clusters (or multiple data centers), sharing resources among competing applications, management of communication and I/O needs, and multi-resource aware job placement.

This workshop will bring together a team of researchers with complementary skills, both from theoretical computer science and systems with the ultimate goal of designing scheduling and resource allocation policies for the next generation of Data center Resource Management Systems (DRMS). While there is a huge scheduling literature in theory community, the success of translating the theoretical results to real systems have been limited. On one hand, a deeper understanding within the theory community of the real issues and constraints facing system builders and designers and building models for better modeling modern applications is required. On the other hand, system builders need to consider a more principled approach of tackling scheduling problems and using optimal, or close to optimal strategies that can significantly impact the design of heuristics used in resource management.

The workshop will give a unique opportunity to both the communities to spend a week together, identify the major issues, start collaboration and solve challenging technical questions. Data center scheduling is a complex problem, and many scheduling questions that we consider solved in theory needs to be revisited in light of it. Here we point out a few example questions that we propose to address during the workshop.

  1. Understanding performance of various competing algorithms on different workloads, going beyond the traditional worst-case guarantees studied in theoretical computer science. Are there any observable properties about the inputs that might allow the algorithm designer to develop significantly improved methods? While some average-case performances have been studied, the goal will be to significantly extend the reach of those methods. Considering just the network alone or just the CPU is often not the right strategy; instead, we want to consider all resources at the same time. The key problem arises from the fact that network is a shared and fungible resource, whereas CPU and memory are not (i.e., they are independent and non-fungible).
  2. The quality of task placement often dictates the quality of scheduling. This is especially true when placement and scheduling decisions are completely isolated, which is often the case. For example, EC2 allocates machines first before the scheduler makes scheduling decisions. Because of a bad placement, we may be forced to overload the communication network. At the same time, coordinating the communication requirements/coflows among applications can result in significantly better utilization of resources.

Participants

Participants

  • Rachit Agarwal (Cornell)
  • Susanne Albers (TUM)
  • Sem Borst* (TU/E, Bell Labs)
  • Mosharaf Choudhury* (Michigan)
  • Julia Chuzhoy (TTI)
  • Michael Dinitz (JHU)
  • Monia Ghobadi (Microsoft)
  • Albert Greenberg (Microsoft)
  • Anupam Gupta (CMU)
  • Mor Harchol Balter (CMU)
  • Longbo Huang (Tsinghua)
  • Sungjin Im (UC Merced)
  • Janardhan Kulkarni (U Minnesota)
  • Vishal Misra (Columbia)
  • Ben Moseley (CMU)
  • Benjamin Moseley (CMU)
  • Seffi Naor (Technion)
  • Debmalya Panigrahi (Duke)
  • Kirk Pruhs (Pittsburgh)
  • Manish Purohit (Google)
  • Rajmohan Rajaraman (NorthEastern U)
  • David Shmoys (Cornell)
  • Evgenia Smirni* (William & Mary)
  • Ola Svensson* (EPFL)