## Near Optimal Coflow Scheduling in Networks

Mosharaf Chowdhury*, Samir Khuller$${}^{\dagger}$$, Manish Purohit$${}^{\ddagger}$$, Sheng Yang**, Jie You*

*: University of Michigan, $${}^{\dagger}$$: Northwestern University, $${}^{\ddagger}$$: Google Research, **: University of Maryland, College Park

styang@cs.umd.edu

## Introduction

### Global data center traffic by destination in 2021 From Cisco Global Cloud Index: Forecast and Methodology, 2016–2021 White Paper: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/white-paper-c11-738085.pdf

## Coflow Scheduling*

• Optimize data transmission for jobs within a data center.
• Machines can communicate with each other directly
• Uniform machine/capacity.

*: first modeled by Chowdhury and Stoica in Coflow: A networking abstraction for cluster applications, HotNets 2012

### Problem Description

• $$n$$ coflows, each represented as matrix $$D^j = [d^j_{io}]$$.
• $$d^j_{io}$$: demand of job $$j$$ from input port $$i$$ to output port $$o$$.
• For every coflow $$j$$:
• $$r_j$$: release time
• $$w_j$$: weight
• Minimize: $$\sum_{i=1}^n w_j\cdot C_j$$.
• $$C_j$$: completion time. A coflow is finished when all its flows are finished.
• Discrete time.
• At each time slot an input/output port can send/receive at most one unit of flow.

## Concurrent Open Shop

• $$n$$ jobs to be scheduled on $$m$$ different machines.
• Each job $$j$$ consists of $$m$$ tasks for each machine. For machine $$i$$, there is an task that takes $$d^j_i$$ time to finish.
• Different tasks of the same job can take place at the same time.
• A job is finished when all its tasks are finished.
• Minimize weighted completion time.

### Hardness for Concurrent Open Shop

• NP-hard* (UGC-hard**) to get $$(2 - \epsilon)$$ approximation for any $$\epsilon > 0$$.
• Implies same lower bound for coflow scheduling.

*: Sushant Sachdeva, Rishi Saket, Optimal inapproximability for scheduling problems via structural hardness for hypergraph vertex cover, CCC 2013

**: Nikhil Bansal, Subhash Khot, Inapproximability of hypergraph vertex cover and applications to scheduling problems, ICALP 2010

## A Brief History

### A Brief History

• Chowdhury and Stoica* first model the coflow scheduling problem.
• Chowdhury, Zhong, and Stoica$${}^{\dagger}$$ give effective heuristic to optimize coflow completion time.

*: Mosharaf Chowdhury, Ion Stoica, Coflow: A networking abstraction for cluster applications, HotNets 2012

$$\dagger$$: Mosharaf Chowdhury, Yuan Zhong, Ion Stoica, Efficient Coflow Scheduling with Varys, SIGCOMM 2014

### A Brief History, Theoretical Side

Qiu, Stein, and Zhong proved the following for coflow scheduling in SPAA 2015:

 Zero release time Arbitrary release time Randomized $$8 + \frac{16\sqrt{2}}{3}$$ $$9 + \frac{16\sqrt{2}}{3}$$ Deterministic $$\frac{64}{3}$$ $$\frac{76}{3}^{*}$$

Z. Qiu, C. Stein, and Y. Zhong, Minimizing the total weighted completion time of coflows in datacenter networks, SPAA 2015. *:The authors claimed a ratio of $$\frac{67}{3}$$, but the proof only holds when all release times are the same.

### A Brief History, Theoretical Side

In SPAA 2016, Khuller and Purohit* improve upon this via a black box reduction to Concurrent Open Shop Problem and get

• Zero release time (deterministic): 8-approx
• Arbitrary release time (deterministic): 12-approx

*: S. Khuller and M. Purohit, Brief Announcement: Improved Approximation Algorithms for Scheduling Co-Flows, SPAA 2016

### A Brief History, Current Best Results

• Zero release time*, $${}^{\dagger}$$: 4-approx
• Arbitrary release time*, $${}^{\dagger}$$: 5-approx
• Combinatorial using primal-dual*
• : joint work with Saba Ahmadi, Samir Khuller, and Manish Purohit, On Scheduling Coflows, IPCO 2017.

$$\dagger$$ : an independent work: A New Improved Bound for Coflow Scheduling by Mehrnoosh Shafiee, Javad Ghaderi in SPAA 2017, give the same bound but used a different LP.

### A Brief History, Implementations

In SIGCOMM 2018, Agarwal et al. implemented Sincronia which uses a similar primal-dual algorithm without release time. Evaluation results suggest that it not only admits a practical, near-optimal design but also improves upon state-of-the-art network designs for coflows.

Saksham Agarwal, Shijin Rajakrishnan, Akshay Narayan, Rachit Agarwal, David Shmoys, Amin Vahdat, Sincronia: near-optimal network design for coflows, SIGCOMM 2018.

## Coflow Scheduling in Networks

• Optimize data transmission for jobs across data centers.
• Data centers may be directly connected, or connected via multiple routers.
• Network modeled as an directed graph with non-uniform structure.
• Sharing a link within a coflow or between coflows is allowed.
• Spliting and merging a flow is also allowed.

### Problem Description

• A weighted directed graph $$G = (V, E)$$ representing the network.
• Weight of an edge is its capacity.
• $$n$$ coflows, each represented as a matrix $$D^j = [d^j_{io}]$$.
• $$d^j_{io}$$: demand of job $$j$$ from node $$i$$ to node $$o$$.
• Either a path is given, or there is no limitation on paths.
• For every coflow $$j$$:
• $$r_j$$: release time
• $$w_j$$: weight
• Minimize: $$\sum_{i=1}^n w_j\cdot C_j$$.
• $$C_j$$: completion time. A coflow is finished when all its flows are finished.
• Discrete time.
• An edge can be shared within a coflow or between coflows, as long as the total bandwidth is within its capacity.

### Problem Description, Continued

• A schedule would decide at each time slot:
• Which flows to send.
• How shall we route each flow.
• At what rate.
• Solution can be fractional.
• Naturally required by data transmission.
• Different from the switch based coflow scheduling problem, where the main difficulty is finding an integral solution.
• The main difficulty comes from coordinating multiple coflows.

### Single Path Model and Free Path Model

• Single path model:
• Path for each flow $$f^i_j$$ given as $$p^i_j$$.
• Only need to decide the rate for each flow at each time.
• Edge bandwidths need to be respected.
• Free path model:
• Path for a flow $$f^i_j$$ not specified.
• Need to find routes for every flow.
• Data can split and merge at vertices to utilize all possible links and their capacities.
• Edge bandwidths need to be respected.

### Existing Results

• Jahanjou, Kantor, and Rajaraman* give a constant approximation ($$17.53$$) algorithm for the single path model.
• In the same paper*, the authors give a tight $$O\left(\frac{\log |E|}{\log \log |E|}\right)$$ algorithm for the case where you need to pick a single path for each demand.
• You and Mosharaf$${}^{\dagger}$$ introduce the free path model, and give a heuristic for the unweighted case without theoretical bound.

*: Hamidreza Jahanjou, Erez Kantor, Rajmohan Rajaraman, Asymptotically Optimal Approximation Algorithms for Coflow Scheduling, 2017 SPAA $$\dagger$$: Jie You, Mosharaf Chowdhury, Terra: Scalable Cross-Layer GDA Optimizations, 2019 in progress

## Our Contributions

• A tight $$2$$ approximation randomized algorithm for both models when release times and demands are polynomial sized.
• NP-hard to get $$(2-\epsilon)$$ approximation for any $$\epsilon > 0$$.
• A $$(2 + \epsilon)$$ randomized approximation algorithm when release times and demands are super-polynomial sized.
• Experiments show significant improvements on existing algorithms.

## Algorithm

### Algorithm Sketch

• Single Coflow:
• Multi commodity flow
• Can be solved optimally in polynomial time.
• Multiple Coflows:
• Time indexed LP
• Stretching to round the solution.

### Time Indexed LP

$$x_j^i(t)$$ is the fraction of flow $$i$$ in job $$j$$ that is finished at time $$t$$. $$X_j(t)$$ is the total fraction of job $$j$$ that is finished by time $$t$$.

\begin{align} \text{Minimize} \quad &\sum_jw_jC_j\\ \text{Subject to} \quad &\sum_tx_{j}^i(t) = 1 && \forall j \in [n], \forall i \in [n_j]\\ &X_j(t) \leq \sum_{\ell = 1}^t x_j^i(\ell) && \forall j \in [n], \forall i \in [n_j], \forall t \in T \\ &C_j \geq 1 + \sum_t (1 - X_j(t)) && \forall j\in [n] \\ &r_j^i \geq t \Rightarrow x_{j}^i(t) = 0 && \forall j \in [n],\forall i \in [n_j], \forall t\in T\\ &x_{j}^i(t) \geq 0 && \forall j \in [n],\forall i \in [n_j], \forall t\in T \end{align}

### Time Indexed LP, Completion Time

Start with thinking $$x_j(t)\in \{0, 1\}$$.

\begin{align*} C_j &\geq \sum_{t = 1}^{T}t\cdot x_j(t) = \sum_{t = 1}^{T}x_j(t)\sum_{\tau=1}^t1\\ &= \sum_{\tau=1}^{T}\sum_{t\geq \tau}^{T}x_j(t) = \sum_{\tau=1}^{T}\left(\sum_{t = 1}^{T}x_j(t) - \sum_{t = 1}^{\tau - 1}x_j(t) \right)\\ &= \sum_{\tau=1}^{T}(1 - X_j(\tau - 1)) = \sum_{\tau=0}^{T- 1}(1 - X_j(\tau)) \\ &= 1 + \sum_{\tau=1}^{T- 1}(1 - X_j(\tau)) \end{align*}

### Constraints for Single Path Model

$\sum_{p_j^i\ni e} x^i_j(t)\cdot \sigma^i_j \leq c(e), \forall e\in E, \forall t\in T$

• $$x^i_j(t, e)$$: the fraction of flow $$f^i_j$$ transmitted through edge $$e$$ in time slot $$t$$.
• $$x^i_j(t)$$: the total fraction of flow $$f^i_j$$ that is transmitted in time slot $$t$$.
\begin{align} \\ &\sum_{e \in \delta_{out}(s_j^i)}x^i_j(t, e) = x^i_j(t), &&\forall j \in [n],\forall i \in [n_j], \forall t\in T\\ &\sum_{e \in \delta_{in}(t_j^i)}x^i_j(t, e) = x^i_j(t), &&\forall j \in [n],\forall i \in [n_j], \forall t\in T\\ &\sum_{e \in \delta_{in}(v)}x^i_j(t, e) = \sum_{e \in \delta_{out}(v)}x^i_j(t, e), &&\forall j \in [n],\forall i \in [n_j], \forall t\in T, \forall v\in V\backslash \{s_j^i, t_j^i\}\\ &\sum_{j \in [n],i \in [n_j]} x^i_j(t, e)\cdot \sigma^i_j \leq c(e), &&\forall t\in T, \forall e\in E%\\ \end{align}

## Future Directions

• Improve Efficiency
• It takes quite long to solve the linear program.
• Is it possible to avoid solving LP?
• Or limit the LP without sacrificing too much on approximation ratio?
• Shed light on switch model
• Model precedence constraints