Student Team: NO
D3.js (for creating interactive visualizations)
Python (for initial data preprocessing/reformatting)
Approximately how
many hours were spent working on this submission in total?
30 (?).
May we post your
submission in the Visual Analytics Benchmark Repository after VAST Challenge
2019 is complete? YES
Video
Questions
1 – Emergency responders will base their
initial response on the earthquake shake map. Use visual analytics to determine
how their response should change based on damage reports from citizens on the ground.
How would you prioritize neighborhoods for response? Which parts of the city are hardest
hit? Limit your response to 1000 words and 10 images.
This question refers to the ‘initial’ response: I interpret this as referring to the conditions on Wednesday morning.
The earthquake shake maps suggest that the earthquake’s effects would be localized to the North East: the prequake shakemap suggests that the preshake would be felt only on the very northeastern coast, if at all, and the main quake shakemap suggests that a perceived shaking stronger than ‘moderate’ would be mostly confined to districts 3, 4, 7 and 12, with ‘light’ shaking in districts 2, 14, 18, 13, 11.
Comparing the shakemap to the reports, it appears that:

Neighborhood 11 reports more shaking (an average score of 3 on the map) than would be expected based on the shakemap (‘not felt’).
Neighborhoods 2, 5, 9, 8 report low levels of shaking (as expected), but high degrees of damage.
The most seriously affected districts seem to be 3, 8, 9, 11, 14, 1. Based on this view, it would seem reasonable to alter an assessment based only on the shakemap by:

Whilst ‘power’ is one of the categories of damage that users can report, the most serious power damage will prevent the receipt of all reports from a district.
On Wednesday morning, soon after the initial burst of reports, there are gaps during which no reports are received from neighborhoods 3, 8, 9, and 10; there is also a much shorter gap for neighborhood 11. At the end of each of these gaps, there is a 5 minute window in which a very large number of reports are received, suggesting that some reports sent during this time were queued for delivery, and successfully restored at the end of a power or communications outage.
There is also a gap in neighborhood 7, but as it is not followed by a burst this probably simply corresponds to a period in which no attempts to submit reports were made; this neighborhood is described as ‘single-family homes in a beautiful, tranquil, wooded area’, so probably has only a small number of residents.
It is impossible to determine from this dataset alone whether these gaps are due to failures of the power supply or simply the communication network; the former might be more serious, as electricity may be required for lighting, heating, cooking, and running medical equipment.

2 – Use visual analytics to show uncertainty
in the data. Compare the reliability of neighborhood reports. Which neighborhoods
are providing reliable reports? Provide a rationale for your response. Limit your
response to 1000 words and 10 images.
There are a number of issues that could affect the reliability of conclusions based on reports from particular regions.
District 7 seems to have only a few report submitters. This creates two problems:
the effect of outliers is larger: for several categories most people report the level of damage as 0, but a single outlier submits a much larger score, and as there are few reports this has a larger effect of the average score for these categories than it would for other neighborhoods (e.g. at 9.30 1 person reports level 10 sewer/water damage whilst 5 report level 0, and 1 person reports level 8 building damage whilst 5 report level 0)
there are long periods in which no reports are received, and during these periods it would be difficult to tell whether there is a power/communication outage, or just no attempts to submit reports


Neighborhood 1 seems to have very bimodal response for several damage categories (sewer/water, power, medical, buildings), with a larger number of reports of more serious damage, but also a smaller number of reports of less damage. This may indicate that parts of this neighborhood are much more seriously affected than others, and this could be investigated further if more granular report locations were available (it could also have other causes, such as reports being mistakenly assigned to the wrong neighborhood, eg. by replacing missing locations with neighborhood 1).
This does not necessarily indicate that the data contains errors, but may indicate that the analysis is being performed at an inappropriate spatial level.
Neighborhood 16 also shows bimodality.

There are several gaps that are presumably due to electricity/communications outages (mostly seriously affecting neighborhoods 3, 8, 9, 10; but also affecting 12, 14, 17). These are clearly shown in the large heatmap included in the answer to Question 1 above.
During these gaps there are no timely reports; the latest available reports are old and thus do not give a reliable indication of the current situation.
Additionally, whilst many messages are received in the 5-minute interval after the outage ends, it is unclear whether some messages were lost, and if so whether some messages (e.g. those sent earlier or later) are more likely to have been lost.
3 – How do conditions change over time?
How does uncertainty in change over time? Describe the key changes you see. Limit
your response to 500 words and 8 images.
Looking at the number of reports over time (see large heatmap in answer to Question 1), we can see three main bursts of reports, and a number of gaps.
The first burst of reports occurs on Monday afternoon: this causes only a small amount of shaking, and little damage.

The second burst of reports occurs on Wednesday morning. This includes a much larger number of reports.

This is quickly followed by outages in neighborhoods 3, 10 and 11; outages affect 8 and 9 soon afterwards.
These outages are resolved by the time of the third burst of reports. This quake quickly causes an outage in neighborhood 4; outages affecting 4, 8,12, 14, and 17 occur later.

Interestingly, the peak in the number of reports occurs later for neighborhoods 8 and 9 than for the other neighborhoods.

I interpret 'uncertainty' as referring generically to a lack of precise knowledge about the actual conditions on the ground.
This uncertainty has a number of forms:
there is uncertainty about the physical meaning of a numerical rating, as there is no data dictionary that defines exactly what is meant by each number: different users may therefore have different views about what constitutes ‘sewer_and_water’ damage of 3, and this might depend on the worst damage that they have seen (e.g. someone might assign a lower score to their home location if their workplace was severely damaged)
there is uncertainty about the location of reports, as these are aggregated into neighborhoods. Conditions might vary across a neighborhood, and neighborhood boundaries are unlikely to neatly coincide with natural breaks in the severity of damage. The potential for errors in location to place a user in the wrong neighborhood is unknown.
there is uncertainty about time due to two major effects: reports are binned into 5 minute windows, limiting temporal resolution; and there are windows during which no reports are received due to power and/or communications outages, and it is the time of recipt (rather than time of sending) that is recorded.
there is uncertainty due to data gaps: during a power/communications outage there is no information about current conditions (except that communication is impossible)
there is uncertainty about data loss: periods in which no reports are received are followed by much higher number of reports in the following 5-minute window, indicating that some reports are delayed, but there is no way of knowing whether some reports have been lost (and if so, whether the lost messages are the oldest reports, the reports sent after a messaging queue filled up, or are randomly distributed).
there is uncertainty about sampling bias: users who have a smartphone capable of using the app and also chose to report it might not have the same spatial distribution as the wider population (for example, it is possible that they might be in wealthier sub-regions, which could have newer/older infrastructure that is more/less susceptible to damage)
there is uncertainty due to sampling noise: whilst there are large numbers of reports submitted after a significant change in conditions, far fewer reports are made at intermediate times when conditions are more stable. A small number of unreliable reports at these times can thus have a large effect on the average rating.
The major changes in uncertainty over time are due to increased uncertainty during outages, and in the gaps between significant events when relatively few reports are submitted.
4 –– The data for this challenge
can be analyzed either as a static collection or as a dynamic stream of data, as
it would occur in a real emergency. Describe how you analyzed the data - as a static
collection or a stream. How do you think this choice affected your analysis? Limit
your response to 200 words and 3 images.
In a real disaster-response scenario, planners would benefit from having access to the latest available information.
However, it is not sufficient to display only the most recent reports, for a number of reasons (e.g., if a power/network outage is preventing the transmission of reports from a region, the most recent available reports from that region may be old, and most reports are made after a significant change in conditions)
I therefore designed a tool that showed all of the available reports. For convenience, it ingests the provided CSV file as a static collection.
Adapting this to act in a streaming mode would not requiring any changes to the visual representations used: the view showing temporal trends could always display the last ~3 days of reports, and the maps and matrix views could always show the most recent reports (unless the analyst has selected a particular time)
However, the data description document states that data is only available every 5 minutes, due to batching resulting from the 'the server configuration'. As pre-processing of the data takes under 5 minutes, there seems to be no real disadvantage to simply reloading the tool each time a new batch of reports is released.