Mosharaf Chowdhury*, Samir Khuller\({}^{\dagger}\), Manish Purohit\({}^{\ddagger}\), Sheng Yang**, Jie You*

*: University of Michigan, \({}^{\dagger}\): Northwestern University, \({}^{\ddagger}\): Google Research, **: University of Maryland, College Park

From Cisco Global Cloud Index: Forecast and Methodology, 2016–2021 White Paper: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/white-paper-c11-738085.pdf

- Optimize data transmission for jobs within a data center.
- Machines can communicate with each other directly
- Uniform machine/capacity.

*: first modeled by Chowdhury and Stoica in Coflow: A networking abstraction for cluster applications, HotNets 2012

- \(n\) coflows, each represented as matrix \(D^j = [d^j_{io}]\).
- \(d^j_{io}\): demand of job \(j\) from input port \(i\) to output port \(o\).

- For every coflow \(j\):
- \(r_j\): release time
- \(w_j\): weight

- Minimize: \(\sum_{i=1}^n w_j\cdot C_j\).
- \(C_j\): completion time. A coflow is finished when all its flows are finished.

- Discrete time.
- At each time slot an input/output port can send/receive at most one unit of flow.

- \(n\) jobs to be scheduled on \(m\) different machines.
- Each job \(j\) consists of \(m\) tasks for each machine. For machine \(i\), there is an task that takes \(d^j_i\) time to finish.
- Different tasks of the same job can take place at the same time.
- A job is finished when all its tasks are finished.
- Minimize weighted completion time.

- NP-hard* (UGC-hard**) to get \((2 - \epsilon)\) approximation for any \(\epsilon > 0\).
- Implies same lower bound for coflow scheduling.

*: Sushant Sachdeva, Rishi Saket, Optimal inapproximability for scheduling problems via structural hardness for hypergraph vertex cover, CCC 2013

**: Nikhil Bansal, Subhash Khot, Inapproximability of hypergraph vertex cover and applications to scheduling problems, ICALP 2010

- Chowdhury and Stoica* first model the coflow scheduling problem.
- Chowdhury, Zhong, and Stoica\({}^{\dagger}\) give effective heuristic to optimize coflow completion time.

*: Mosharaf Chowdhury, Ion Stoica, Coflow: A networking abstraction for cluster applications, HotNets 2012

\(\dagger\): Mosharaf Chowdhury, Yuan Zhong, Ion Stoica, Efficient Coflow Scheduling with Varys, SIGCOMM 2014

Qiu, Stein, and Zhong proved the following for coflow scheduling in SPAA 2015:

Zero release time | Arbitrary release time | |

Randomized | \(8 + \frac{16\sqrt{2}}{3}\) | \(9 + \frac{16\sqrt{2}}{3}\) |

Deterministic | \(\frac{64}{3}\) | \(\frac{76}{3}^{*}\) |

Z. Qiu, C. Stein, and Y. Zhong, Minimizing the total weighted completion time of coflows in datacenter networks, SPAA 2015. *:The authors claimed a ratio of \(\frac{67}{3}\), but the proof only holds when all release times are the same.

In SPAA 2016, Khuller and Purohit* improve upon this via a black box reduction to Concurrent Open Shop Problem and get

- Zero release time (deterministic): 8-approx
- Arbitrary release time (deterministic): 12-approx

*: S. Khuller and M. Purohit, Brief Announcement: Improved Approximation Algorithms for Scheduling Co-Flows, SPAA 2016

- Zero release time*, \({}^{\dagger}\): 4-approx
- Arbitrary release time*, \({}^{\dagger}\): 5-approx
- Combinatorial using primal-dual*

- : joint work with Saba Ahmadi, Samir Khuller, and Manish Purohit, On Scheduling Coflows, IPCO 2017.

\(\dagger\) : an independent work: A New Improved Bound for Coflow Scheduling by Mehrnoosh Shafiee, Javad Ghaderi in SPAA 2017, give the same bound but used a different LP.

In SIGCOMM 2018, Agarwal et al. implemented Sincronia which uses a similar primal-dual algorithm without release time. Evaluation results suggest that it not only admits a practical, near-optimal design but also improves upon state-of-the-art network designs for coflows.

Saksham Agarwal, Shijin Rajakrishnan, Akshay Narayan, Rachit Agarwal, David Shmoys, Amin Vahdat, Sincronia: near-optimal network design for coflows, SIGCOMM 2018.

- Optimize data transmission for jobs across data centers.
- Data centers may be directly connected, or connected via multiple routers.
- Network modeled as an directed graph with non-uniform structure.
- Sharing a link within a coflow or between coflows is allowed.
- Spliting and merging a flow is also allowed.

- A weighted directed graph \(G = (V, E)\) representing the network.
- Weight of an edge is its capacity.

- \(n\) coflows, each represented as a matrix \(D^j = [d^j_{io}]\).
- \(d^j_{io}\): demand of job \(j\) from node \(i\) to node \(o\).
- Either a path is given, or there is no limitation on paths.

- For every coflow \(j\):
- \(r_j\): release time
- \(w_j\): weight

- Minimize: \(\sum_{i=1}^n w_j\cdot C_j\).
- \(C_j\): completion time. A coflow is finished when all its flows are finished.

- Discrete time.
- An edge can be shared within a coflow or between coflows, as long as the total bandwidth is within its capacity.

- A schedule would decide at each time slot:
- Which flows to send.
- How shall we route each flow.
- At what rate.

- Solution can be fractional.
- Naturally required by data transmission.
- Different from the switch based coflow scheduling problem, where the main difficulty is finding an integral solution.
- The main difficulty comes from coordinating multiple coflows.

- Single path model:
- Path for each flow \(f^i_j\) given as \(p^i_j\).
- Only need to decide the rate for each flow at each time.
- Edge bandwidths need to be respected.

- Free path model:
- Path for a flow \(f^i_j\) not specified.
- Need to find routes for every flow.
- Data can split and merge at vertices to utilize all possible links and their capacities.
- Edge bandwidths need to be respected.

- Jahanjou, Kantor, and Rajaraman* give a constant approximation (\(17.53\)) algorithm for the single path model.
- In the same paper*, the authors give a tight \(O\left(\frac{\log |E|}{\log \log |E|}\right)\) algorithm for the case where you need to pick a single path for each demand.
- You and Mosharaf\({}^{\dagger}\) introduce the free path model, and give a heuristic for the unweighted case without theoretical bound.

*: Hamidreza Jahanjou, Erez Kantor, Rajmohan Rajaraman, Asymptotically Optimal Approximation Algorithms for Coflow Scheduling, 2017 SPAA \(\dagger\): Jie You, Mosharaf Chowdhury, Terra: Scalable Cross-Layer GDA Optimizations, 2019 in progress

- A tight \(2\) approximation randomized algorithm for both models when release times and demands are polynomial sized.
- NP-hard to get \((2-\epsilon)\) approximation for any \(\epsilon > 0\).
- A \((2 + \epsilon)\) randomized approximation algorithm when release times and demands are super-polynomial sized.
- Experiments show significant improvements on existing algorithms.

- Single Coflow:
- Multi commodity flow
- Can be solved optimally in polynomial time.

- Multiple Coflows:
- Time indexed LP
- Stretching to round the solution.

\(x_j^i(t)\) is the fraction of flow \(i\) in job \(j\) that is finished at time \(t\). \(X_j(t)\) is the total fraction of job \(j\) that is finished by time \(t\).

\begin{align}
\text{Minimize} \quad &\sum_jw_jC_j\\
\text{Subject to} \quad &\sum_tx_{j}^i(t) = 1 && \forall j \in [n], \forall i \in [n_j]\\
&X_j(t) \leq \sum_{\ell = 1}^t x_j^i(\ell) && \forall j \in [n], \forall i \in [n_j], \forall t \in T \\
&C_j \geq 1 + \sum_t (1 - X_j(t)) && \forall j\in [n] \\
&r_j^i \geq t \Rightarrow x_{j}^i(t) = 0 && \forall j \in [n],\forall i \in [n_j], \forall t\in T\\
&x_{j}^i(t) \geq 0 && \forall j \in [n],\forall i \in [n_j], \forall t\in T
\end{align}

Start with thinking \(x_j(t)\in \{0, 1\}\).

\begin{align*}
C_j &\geq \sum_{t = 1}^{T}t\cdot x_j(t) = \sum_{t = 1}^{T}x_j(t)\sum_{\tau=1}^t1\\
&= \sum_{\tau=1}^{T}\sum_{t\geq \tau}^{T}x_j(t) = \sum_{\tau=1}^{T}\left(\sum_{t = 1}^{T}x_j(t) - \sum_{t = 1}^{\tau - 1}x_j(t) \right)\\
&= \sum_{\tau=1}^{T}(1 - X_j(\tau - 1)) = \sum_{\tau=0}^{T- 1}(1 - X_j(\tau)) \\
&= 1 + \sum_{\tau=1}^{T- 1}(1 - X_j(\tau))
\end{align*}

\[\sum_{p_j^i\ni e} x^i_j(t)\cdot \sigma^i_j \leq c(e), \forall e\in E, \forall t\in T\]

- \(x^i_j(t, e)\): the fraction of flow \(f^i_j\) transmitted through edge \(e\) in time slot \(t\).
- \(x^i_j(t)\): the total fraction of flow \(f^i_j\) that is transmitted in time slot \(t\).

\begin{align}
\\
&\sum_{e \in \delta_{out}(s_j^i)}x^i_j(t, e) = x^i_j(t), &&\forall j \in [n],\forall i \in [n_j], \forall t\in T\\
&\sum_{e \in \delta_{in}(t_j^i)}x^i_j(t, e) = x^i_j(t), &&\forall j \in [n],\forall i \in [n_j], \forall t\in T\\
&\sum_{e \in \delta_{in}(v)}x^i_j(t, e) = \sum_{e \in \delta_{out}(v)}x^i_j(t, e), &&\forall j \in [n],\forall i \in [n_j], \forall t\in T, \forall v\in V\backslash \{s_j^i, t_j^i\}\\
&\sum_{j \in [n],i \in [n_j]} x^i_j(t, e)\cdot \sigma^i_j \leq c(e), &&\forall t\in T, \forall e\in E%\\
\end{align}

- Improve Efficiency
- It takes quite long to solve the linear program.
- Is it possible to avoid solving LP?
- Or limit the LP without sacrificing too much on approximation ratio?

- Shed light on switch model
- Model precedence constraints

Q & A

- Reveal.js: the slide framework
- magic.css: some animations
- org-mode and org-reveal: for making the slides