Entry Name:  "ICL-DSI-MC1"

VAST Challenge 2019
Mini-Challenge 1

 

 

Team Members:

James Scott-Brown, Data Science Institute, Imperial College London, james@jamesscottbrown.com.

Student Team:  NO

 

Tools Used:

D3.js (for creating interactive visualizations)

Python (for initial data preprocessing/reformatting)

 

Approximately how many hours were spent working on this submission in total?

30 (?).

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2019 is complete? YES

 

Video

https://vimeo.com/347855541

 

 

 

Questions

1Emergency responders will base their initial response on the earthquake shake map. Use visual analytics to determine how their response should change based on damage reports from citizens on the ground. How would you prioritize neighborhoods for response? Which parts of the city are hardest hit? Limit your response to 1000 words and 10 images.

This question refers to the ‘initial’ response: I interpret this as referring to the conditions on Wednesday morning.

Comparing shakemaps and reports

The earthquake shake maps suggest that the earthquake’s effects would be localized to the North East: the prequake shakemap suggests that the preshake would be felt only on the very northeastern coast, if at all, and the main quake shakemap suggests that a perceived shaking stronger than ‘moderate’ would be mostly confined to districts 3, 4, 7 and 12, with ‘light’ shaking in districts 2, 14, 18, 13, 11.

Comparing the shakemap to the reports, it appears that:

  • neighborhood 4 is reported to be less seriously affected than would be predicted
  • the earthquake’s impact extends across the whole island, with high degrees of damage to neighborhoods around the far edge of the island (2, 5, 9, 8, 11)
A map showing the damage reports received at 9.00 on Wednesday morning. The rainbow colormap is the quakemap for the major shake. Rectangular heatmaps show the distribution of reports for each district: within these, horizontal position represents damage category, vertical position represent degree of damage (higher up is worse), and color represents the number of reports. It can be seen that some neighbourhoods around the edge of the island (2, 5, 9, 8, 11) have much more severe damage reproted than would be expected from the shakemap.

Prioritising neighbourhoods

Neighborhood 11 reports more shaking (an average score of 3 on the map) than would be expected based on the shakemap (‘not felt’).

Neighborhoods 2, 5, 9, 8 report low levels of shaking (as expected), but high degrees of damage.

The most seriously affected districts seem to be 3, 8, 9, 11, 14, 1. Based on this view, it would seem reasonable to alter an assessment based only on the shakemap by:

  • de-prioritizing responding to neighborhood 4
  • retaining the prioritization of 3
  • increasing the prioritisation of 11 and 14
  • significantly increasing the prioritisation of 8, 9 and 1
A matrix view of the damage reports received at 9.00 on Wednesday morning. Rows correspond to damage categories, and columns correspond to neighborhoods. The background of each cell is the average score for the corresponding damage category and location, encoded using a single-hue red colormap: darker red indicates more serious damage. Within each cell, a barchart shows the distribution of reports: the vertical position of bar indicates the damage level, and the length indicates the number of reports, which is also redundantly encoded as color using a Viridis colormap.

Power outages

Whilst ‘power’ is one of the categories of damage that users can report, the most serious power damage will prevent the receipt of all reports from a district.

On Wednesday morning, soon after the initial burst of reports, there are gaps during which no reports are received from neighborhoods 3, 8, 9, and 10; there is also a much shorter gap for neighborhood 11. At the end of each of these gaps, there is a 5 minute window in which a very large number of reports are received, suggesting that some reports sent during this time were queued for delivery, and successfully restored at the end of a power or communications outage.

There is also a gap in neighborhood 7, but as it is not followed by a burst this probably simply corresponds to a period in which no attempts to submit reports were made; this neighborhood is described as ‘single-family homes in a beautiful, tranquil, wooded area’, so probably has only a small number of residents.

It is impossible to determine from this dataset alone whether these gaps are due to failures of the power supply or simply the communication network; the former might be more serious, as electricity may be required for lighting, heating, cooking, and running medical equipment.

A heat-map view of damage reports. This consists of multiple subplots with a shared horizontal time axis. Reports are subdivided into subplots first by district (with the same background color), and then by damage category. Within each subplot, a matrix is formed by dividing time into 5-minute windows corresponding to the provided bins of reports, and dividing horizontally into the different levels of damage; each element of this matrix is colored according to the number of reports received using a Viridis colormap. There are several apparent power/communications outages, which appear as long gaps during which no messages are received from particular districts, each followed by a single 5-minute window in which a large number of messages are received.

2Use visual analytics to show uncertainty in the data. Compare the reliability of neighborhood reports. Which neighborhoods are providing reliable reports? Provide a rationale for your response. Limit your response to 1000 words and 10 images.

There are a number of issues that could affect the reliability of conclusions based on reports from particular regions.

Report numbers

District 7 seems to have only a few report submitters. This creates two problems:

  • the effect of outliers is larger: for several categories most people report the level of damage as 0, but a single outlier submits a much larger score, and as there are few reports this has a larger effect of the average score for these categories than it would for other neighborhoods (e.g. at 9.30 1 person reports level 10 sewer/water damage whilst 5 report level 0, and 1 person reports level 8 building damage whilst 5 report level 0)

  • there are long periods in which no reports are received, and during these periods it would be difficult to tell whether there is a power/communication outage, or just no attempts to submit reports

Heat map of reports from neighborhood 7, showing their sparsity
The matrix view shows not only the average damage score for each location/category pair, but also shows the number of reports to give an indication of reliability

Bimodality

Neighborhood 1 seems to have very bimodal response for several damage categories (sewer/water, power, medical, buildings), with a larger number of reports of more serious damage, but also a smaller number of reports of less damage. This may indicate that parts of this neighborhood are much more seriously affected than others, and this could be investigated further if more granular report locations were available (it could also have other causes, such as reports being mistakenly assigned to the wrong neighborhood, eg. by replacing missing locations with neighborhood 1).

This does not necessarily indicate that the data contains errors, but may indicate that the analysis is being performed at an inappropriate spatial level.

Neighborhood 16 also shows bimodality.

Zoomed in view of the heat-map shows bimodality in the damage scores for sewer/water (top row), power (second row), medical (fourth row), and buildings (fifth row) in neighborhood 1

Missing data

There are several gaps that are presumably due to electricity/communications outages (mostly seriously affecting neighborhoods 3, 8, 9, 10; but also affecting 12, 14, 17). These are clearly shown in the large heatmap included in the answer to Question 1 above.

During these gaps there are no timely reports; the latest available reports are old and thus do not give a reliable indication of the current situation.

Additionally, whilst many messages are received in the 5-minute interval after the outage ends, it is unclear whether some messages were lost, and if so whether some messages (e.g. those sent earlier or later) are more likely to have been lost.


3How do conditions change over time? How does uncertainty in change over time? Describe the key changes you see. Limit your response to 500 words and 8 images.

Changing conditions over time

Looking at the number of reports over time (see large heatmap in answer to Question 1), we can see three main bursts of reports, and a number of gaps.

The first burst of reports occurs on Monday afternoon: this causes only a small amount of shaking, and little damage.

Matrix view of reports from the first burstb< of reports on Monday afternoon

The second burst of reports occurs on Wednesday morning. This includes a much larger number of reports.

Matrix view of reports from the second burstb> of reports on Wednesday morning

This is quickly followed by outages in neighborhoods 3, 10 and 11; outages affect 8 and 9 soon afterwards.

These outages are resolved by the time of the third burst of reports. This quake quickly causes an outage in neighborhood 4; outages affecting 4, 8,12, 14, and 17 occur later.

Matrix view of reports from early in the third burst of reports on Thursday afternoon

Interestingly, the peak in the number of reports occurs later for neighborhoods 8 and 9 than for the other neighborhoods.

Matrix view of reports from late in the third burst of reports on Thursday afternoon (during peak in reporting frequency for neighborhoods 8 and 9)

Changing uncertainty over time

I interpret 'uncertainty' as referring generically to a lack of precise knowledge about the actual conditions on the ground.

This uncertainty has a number of forms:

  • there is uncertainty about the physical meaning of a numerical rating, as there is no data dictionary that defines exactly what is meant by each number: different users may therefore have different views about what constitutes ‘sewer_and_water’ damage of 3, and this might depend on the worst damage that they have seen (e.g. someone might assign a lower score to their home location if their workplace was severely damaged)

  • there is uncertainty about the location of reports, as these are aggregated into neighborhoods. Conditions might vary across a neighborhood, and neighborhood boundaries are unlikely to neatly coincide with natural breaks in the severity of damage. The potential for errors in location to place a user in the wrong neighborhood is unknown.

  • there is uncertainty about time due to two major effects: reports are binned into 5 minute windows, limiting temporal resolution; and there are windows during which no reports are received due to power and/or communications outages, and it is the time of recipt (rather than time of sending) that is recorded.

  • there is uncertainty due to data gaps: during a power/communications outage there is no information about current conditions (except that communication is impossible)

  • there is uncertainty about data loss: periods in which no reports are received are followed by much higher number of reports in the following 5-minute window, indicating that some reports are delayed, but there is no way of knowing whether some reports have been lost (and if so, whether the lost messages are the oldest reports, the reports sent after a messaging queue filled up, or are randomly distributed).

  • there is uncertainty about sampling bias: users who have a smartphone capable of using the app and also chose to report it might not have the same spatial distribution as the wider population (for example, it is possible that they might be in wealthier sub-regions, which could have newer/older infrastructure that is more/less susceptible to damage)

  • there is uncertainty due to sampling noise: whilst there are large numbers of reports submitted after a significant change in conditions, far fewer reports are made at intermediate times when conditions are more stable. A small number of unreliable reports at these times can thus have a large effect on the average rating.

The major changes in uncertainty over time are due to increased uncertainty during outages, and in the gaps between significant events when relatively few reports are submitted.


4The data for this challenge can be analyzed either as a static collection or as a dynamic stream of data, as it would occur in a real emergency. Describe how you analyzed the data - as a static collection or a stream. How do you think this choice affected your analysis? Limit your response to 200 words and 3 images.

In a real disaster-response scenario, planners would benefit from having access to the latest available information.

However, it is not sufficient to display only the most recent reports, for a number of reasons (e.g., if a power/network outage is preventing the transmission of reports from a region, the most recent available reports from that region may be old, and most reports are made after a significant change in conditions)

I therefore designed a tool that showed all of the available reports. For convenience, it ingests the provided CSV file as a static collection.

Adapting this to act in a streaming mode would not requiring any changes to the visual representations used: the view showing temporal trends could always display the last ~3 days of reports, and the maps and matrix views could always show the most recent reports (unless the analyst has selected a particular time)

However, the data description document states that data is only available every 5 minutes, due to batching resulting from the 'the server configuration'. As pre-processing of the data takes under 5 minutes, there seems to be no real disadvantage to simply reloading the tool each time a new batch of reports is released.