PhD Proposal: Probing-based measurement of last-mile Internet reliability
Internet reliability is increasingly important as a variety of services that we use migrate to the Internet. Yet, we lack authoritative measures of last-mile Internet reliability. The first step towards measuring last-mile reliability is to detect Internet outage events experienced by users. Since Internet outages are rare events, detecting them requires broad and longitudinal measurements; however, such measurements of Internet reliability at the individual user level are challenging to obtain accurately and at scale. The second step is to determine which of the detected outages are consistent with the failure of an ISP's operated equipment, and which outages can be attributed to other causes such as power, and use these categorized outages to reason about Internet reliability across different dimensions such as ISPs, media-types, and geographical areas.
I investigate probing-based remote outage detection techniques for detecting outages due to their ability to scale and propose approaches to improve their accuracy. These techniques detect Internet outages across time as well as across the IPv4 address space by sending active probes, such as pings and traceroutes, to users' IP addresses and use probe responses to infer Internet connectivity. However, they can infer false outages since their foundational assumption can sometimes be invalid: that the lack of response to an active probe is indicative of failure. I illustrate two potential scenarios where this assumption is invalid. In the first scenario, responses are delayed beyond the prober's timeout, leading these techniques to infer packet-loss instead of delay. In the second scenario, these techniques can falsely infer packet-loss when the address they are probing gets dynamically reassigned. I examine how commonly delayed responses and dynamic reassignment occur across ISPs and propose enhancements to probing-based remote outage detection techniques to infer outages and their durations accurately even when these scenarios occur.
After obtaining accurate outage measurements, the next step in performing meaningful assessments of reliability is the segregation of detected outages into categories that suggest their cause. Outages could result from a variety of causes, such as power outages, voluntary shutdown of users' home Internet equipment, network outages due to an ISP's infrastructure failure etc. When assessing the reliability of a particular ISP, we would ideally consider only the subset of outages that affect solely that ISP. I propose a technique to segregate outages by instrumenting probing-based remote outage detection techniques to also probe ``related'' addresses, that share geography, ISP, or network topology. Simultaneous outages in related addresses can serve as evidence that a detected outage affected multiple users in a particular ISP.
Implemementing these proposed techniques will help achieve comprehensive measurements of Internet reliability that can be used to identify vulnerable networks and their challenges, inform which enhancements can help networks improve reliability, and evaluate the efficacy of deployed enhancements over time.
Chair: Dr. Neil Spring
Dept rep: Dr. Atif Memon
Member: Dr. Bobby Bhattacharjee