Beyond Binary Failures: A New Framework for Networks
Fiber optic cables are the workhorses of today's Internet services, but they are an expensive resource and require significant monetary investment. Their importance has driven a conservative deployment approach with redundancy baked into multiple layers of the network under the assumption that links have a constant reliability status and operate at a fixed capacity. In this talk, I take an unconventional approach and argue that link failures should not be always considered binary events; this approach enables the foundation of a framework for network links with dynamic capacity and reliability. I investigated this idea by conducting the first ever large-scale study of operational optical signals, analyzing over 2,000 channels in a wide-area network for a period of three years, as well as 350,000 links in 20 data center networks worldwide. My analysis uncovered several findings that enable cross-layer optimizations and smart algorithms to improve traffic engineering, increase capacity, and reduce cost. First, the capacity of 99% of wide-area links can be augmented by at least 50 Gbps, leading to an overall capacity gain of more than 100 Tbps. This means we get higher capacity and better availability using the same links. Second, I will show that 99.99% of data center links have an incoming optical power level that is higher than the design threshold; by allowing links to have multiple reliability levels, we can cut the cost of data center networks by nearly half. Finally, the framework opens the door to revisiting several classical networking problems, such as the maximum-flow problem and graph abstractions. Microsoft has invested in this new framework and is rolling out the necessary infrastructure for deployment.