Application Project - Analyzing Airline On-Time Data

Nicholas Chen

The US Department of Transportation (DOT) has a Bureau of Transportation Statistics (BTS) that keeps meticulous statistics about most modes of transportation. Among these statistics is a comprehensive set of data documenting the on-time statistics of every domestic flight. For every flight, many statistics, including flight origin and destination, scheduled departure and arrival time, and even the time spent taxiing in and out, are recorded by airlines and submitted to the BTS.

Recognizing that very few things are quite as frustrating as having a delayed flight, and as one who has always been interested in commercial aviation and flies on a regular basis, my application project is to visualize the airline on-time statistics with the goal of being able to easily identify which airlines or airports are prone to delays, and any other trends. Using this knowledge, travelers would better know when to expect delays, and can see how to avoid them.

First Step - Reducing and Aggregating the Data

One interesting bit of information I acquired prior to using any of the visualization tools was that there are over 500,000 domestic flights per month. Accordingly, the data files, which contain one line for each flight, are rather massive, with file sizes on the order of 200MB per month. As the demo datasets for the various visualization tools are on the order of kilobytes, I was confident that the size of the raw data far exceeded the capabilities of the software. Therefore, first step in the analysis was to create some scripts that filtered down the data to a more manageable size. Instead of looking at all airports, I used a separate database available from the BTS--the Air Carrier Summary Data--to find the top 30 airports and top 15 carriers in terms of number of passengers served. The busiest airports and carriers served as my initial filtering criteria and halved the number of flights. For the Treemap analyses, for each month, I aggregated individual flights into bins by day of the week, summing up indicator variables and averaging times to produce a final dataset to analyze.

Part 1 - Treemap

Treemap Visualization 1 - Distribution of Departures at Major Airports

For the first visualization (Figure 1) we track the number of departures made by a carrier at each airport. The size of the cells represents the number of departures. For easier identification, the cells in the treemap are color-coded in a color similar to the carrier's corporate color scheme (this is done where possible, as many airlines share similar colors). For carriers that are contracted by major carriers, such as American Eagle (affiliated with American Airlines) or ComAir (affiliated with Delta), a color was selected that was similar to its affiliate.

Figure 1 - Distribution of Airlines at Airports

One interesting feature of the visualization is that it is very easy to determine which carriers dominate certain airports. Based on this, one can quickly see which airports serve as hubs for a particular airline. For example, DFW Dallas Ft.Worth) is clearly a hub for American Airlines (AA), with AA and its contract carrier American Eagle’s flights accounting for over 80% of all departures. The massive skew in distribution toward a carrier at some airports was definitely a surprise.

Visualization 2 – Fraction of Departures that Arrived Late

Figure 2 is designed to show how punctual flights departing from a particular airport are. Since more departures result in more late arrivals, in order to compare between carriers at each airport, the colors of the cells are mapped to the percentage of flights that arrived at the destination more than 30 minutes late. The size of the cells remains mapped to the number of departures.

Figure 2 - Percentage of Flights Resulting in Delays

Several discoveries can be made using the Treemap. First, it is easy to see that the three busiest airports, ORD (Chicago O’Hare), ATL (Atlanta), and DFW, generally do not fare well in getting flights to the destinations on time. However, one can pick out the green outliers to determine that America West flights leaving ORD have a good track record for being on-time. One can do the same at the other airports shown.

Visualization 3 – Hierarchy by Airline

Figure 3 has the same settings as the previous Treemap, except the cells are grouped by airline. One can, at a glance, see the overall on-time performance for the different airlines. One analysis one can make is to see whether delays for a particular airline are characteristic for the carrier (due to issues like poor scheduling and slow turnaround time) or caused by a particular airport.

Figure 3 - Airport delays, grouped by airline

As shown by the large swath of green, America West has the best on-time performance of the airlines, a fact that is proudly advertised by the airline. Another interesting feature is one can see the reliance of various carriers on hubs. Southwest, unlike most other airlines, does not have a significant portion of its flights originating from one or two airports. The point-to-point approach that Southwest has adopted for most routes is often credited as one reason Southwest has strong on-time performance.

Spotfire

I wished to explore the data in a little more detail and elected to use Spotfire to examine aspects of the data not suitable with a Treemap.

Spotfire Visualization 1 – Late Aircraft

One question I wanted to resolve was whether my Treemap analyses were valid. Specifically, whether the departing airport played a role in whether a flight arrived late. Intuitively, this seems the case, but I wished to make sure that flights were not arriving late because they were held up prior to reaching the destination airport. Figure 4 shows a strong linear correlation between departure delay time and arrival delay time, suggesting that the previous analyses were valid.

Figure 4 - Departure Delay causes Arrival Delay

Spotfire Visualization 2 – Cause of Late Aircraft

As the Figure 5 shows, the majority of the arrival delay time is due to a flight arriving late at the departure location. Other factors that cause delay like taxi time and security issues do not show this correlation. This suggests a cascade effect where one late flight can cause many more late flights. For travelers connecting at hubs, this is especially problematic because the scheduling is often done with the expectation that many flights arrive from different locations at the same time, and passengers simply switch planes. However, if one flight is slightly late, it may end up delaying a handful of planes.

Figure 5 - Late arrival correlates with arrival delay

Tool Evaluation

I was impressed by the clarity with which the Treemap was able to present the data. However, I believe that the filtering and pruning of data to put data into Treemap format somewhat limits the free exploration of data. One must have an idea what pieces of data are likely to be interesting prior to creating the Treemap. However, once the Treemap is built, the results are quite stunning. Spotfire’s main strength to be its ability to rapidly allow the exploration of different dimensions of data, and was a joy to use for going through all the different columns present in the data set. The charts Spotfire generates are not as exciting, however. The ability to handle massive amounts of data would have been a tremendous time-saver. But since even Excel and Access were choking on the raw data, I have a feeling that custom pre-processing scripts may be around for a while.