Mini-Challenge 3
Johannes Knittel, University of Stuttgart, johannes.knittel@vis.uni-stuttgart.de PRIMARY
Steffen Koch, University of Stuttgart, steffen.koch@vis.uni-stuttgart.de
Thomas Ertl, University of Stuttgart, thomas.ertl@vis.uni-stuttgart.de
Student Team: NO
Self-developed tool for extracting quotes [1], adapted for the VAST challenge
[1] Knittel, J., Koch, S. & Ertl, T. (2019). Interactive Hierarchical Quote Extraction for Content Insights. In J. Madeiras Pereira & R. G. Raidou (eds.), EuroVis 2019 - Posters, : The Eurographics Association. ISBN: 978-3-03868-088-8
Approximately how many hours were spent working on this submission in total?
80
May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2019 is complete? YES
Video
Questions
The City has been using Y*INT to communicate with its citizens, even post-earthquake. However, City officials needs additional information to determine the best way to allocate emergency resources across all neighborhoods of St. Himark. Your task, using your visual analytics on the community Y*INT data, is to determine the types of problems that are occurring across the St. Himark. Then, advise the City on how to prioritize the distribution of resources. Keep in mind that not all sources on Y*INT are reliable, and that priorities may change over time as the state of neighborhoods also changes.
1 Using visual analytics, characterize conditions across the city and recommend how resources should be allocated at 5 hours and 30 hours after the earthquake. Include evidence from the data to support these recommendations. Consider how to allocate resources such as road crews, sewer repair crews, power, and rescue teams. Limit your response to 1000 words and 12 images.
Our tool [1] analyzes micro-documents to extract shortened quotes that represent common patterns within the data set. If 100 users post a message that starts with “i love ...”, for instance, we can aggregate these messages to the pattern “[...] i love [...]”. Our assumption is that posts about similar things often share chunks of words. Preserving the order of these chunks helps to make sense of the content. Our method is threshold-based, i.e. it will find patterns that occur at least as often as some adjustable threshold. If the analyst double-clicks on an item, more fine-grained quotes matching the clicked pattern are extracted using a lower threshold, all the way down to individual documents.
We’ve updated our tool to support the Y*INT data set. We added a timeline to show the distribution of messages over time and parse the location (i.e. district) to display the (relative) number of messages per district using both a bar chart and a shaded city map.
After loading the data set the initial view looks like this:
We have parsed 41862 posts and extracted patterns (words, phrases, connected phrases, …) that occur at least 125 times (current threshold). In this top-level view the quotes are ranked according to the number of unique posts and the likelihood of the word constellation. This likelihood is calculated using a simple language model that we have built from millions of tweets. We assume that text patterns that occur unusually often have a higher chance to contain interesting information. On the Y*INT data set this leads to some artefacts such as high-ranked misspellings of “the” (“thgehe”, “tehhe”, …), because these ‘variants’ usually don’t occur on real-life Twitter. Furthermore, we experienced some encoding issues even though we tried several common (western) coding standards, so we assume that this is a deliberate attempt to introduce some data quality challenges.
To the left in each row the number of matching posts is displayed, and the thin bar in darker gray represents the proportion of unique posts. This helps in determining the validity of certain statements. On the bottom there is a histogram (magenta) showing the number of posts per hour. The legend indicates the date of important bars (first one, peaks). To the right, the bar chart in dark-orange visualizes the number of posts per district. To make this more accessible, we also accordingly shade the districts on the city map. The views are linked, i.e., if the analyst double-clicks on a pattern to retrieve more fine-grained quotes, the charts and map are updated accordingly to reflect the currently selected subset of the data.
The timeline and location bar chart already reveal interesting aspects of the data set: the day-night patterns, some irregularities starting with the third day, that the most messages per hour were published on April 08 between 2 and 3pm, that hardly any message was sent from Wilson Forest, and that only very few messages are not associated with a location (we collect them in location “20”).
Looking at the top-level patterns we can also deduce some initial findings: there seems to be fatalities, people begin to panic about groceries, the city’s evacuating, some have troubles finding the disaster shelters, and there are a bunch of messages that apparently consist of randomly concatenated words. (Less importantly, there are also neighbor issues just like everywhere). Approaches using n-grams may run into trouble with this kind of noise, but with our method that is able to capture relatively long patterns analysts can quickly see whether the context is relevant and contains useful information.
We first want to know when the earthquake actually happened, hence, we first double-click on the “... earthquake …” and then the “...earthquake … st himark …” pattern. The resulting messages suggest that there was a mild quake on April 6 that does not seem to be relevant here, and a major one with expected damage on April 8 at around 8am:
To further validate that hypothesis we look at messages from potential eye witnesses and the temporary evolution of messages concerning food stocking:
These findings confirm the hypothesis that the relevant major earthquake happened around April 8 at 8am.
To better assess the initial situation as requested, we narrow the time window to the first five hours (08:00-13:00) after the quake:
Several expressions regarding earthquake-related symptoms are mainly sent from areas around Downtown (1,2,6,15), indicating that this could be the regions most severely affected where the rescue efforts should be focusing on. However, this should be taken with a grain of salt (as always), because people in other regions maybe don’t have internet access anymore or are simply dead. One exemplary result concerning “moving”:
Except for the Himark Bridge, every bridge is closed, basically cutting off the island from the mainland. This should be a top priority for road crews to fix to ensure flow of supplies and enable help from the mainland:
However, this is apparently already in the making, as at least bridge A and B seem to be open shortly afterwards for emergency operations:
The location patterns do not show many earthquake-related observations from district 4, but because of the nuclear power plant extra caution is needed. Some damage is reported, but the plant has shut down:
People complain about missing power and there are many reports about collapsed buildings in various districts, especially north-west and around Downtown:
Besides fixing the bridges to allow for external supplies and help, rescue teams should be sent as early as possible to find trapped and potentially injured people within damaged buildings. Unfortunately, due to the number of affected regions, it will be difficult to assign rescue workers to every affected place. Some locals already started community-based efforts.
We now shift the time window to 04-08 13:00 - 04-09 14:00. Many people, especially in Southwest, are desperately trying to find shelters or report that they are really crowded. Officials should clearly communicate where shelters can be found and whether they are already at capacity. There are multiple reports that the city’s evacuating, which seems challenging with the road and bridge situation. Reports about casualties differ a lot and range from 9 to around 600 fatalities. People feel “alone right now” due to missing and dead relatives.
The hospitals cry for help, because they have troubles fulfilling their job due to piles of brick and “running out of critical medical supplies”. All hospitals seem to be (partly) affected according to the locations of the messages:
Furthermore, people wonder why the fire station is closed.
For about 80%, internet and power is restored at around 7am the day after the earthquake, but people are reminded to lower the pressure on the system and stop uploading videos etc.
Around 5 hours after the earthquake, severe damage to the water and sewer pipes is reported in neighborhoods 4, 8, 9, 10 and 14, and people are urged to drink bottled water or boil their water:
2 Identify at least 3 times when conditions change in a way that warrants a re-allocation of city resources. What were the conditions before and after the inflection point? What locations were affected? Which resources are involved? Limit your response to 1000 words and 10 images.
Initially, all major bridges were closed. Some emergency operations were allowed on bridges A,B shortly afterwards, but hospitals were still running out of medicines. Five hours later Magritte (A) was opened, and on April 9 the Jade and Friday bridge. This means that road crews initially working on the bridges can now join efforts in cleaning up the island itself to enable crucial transportations for the hospital, for instance.
This is especially important, because on April 9 at 3pm there are reports that rescue efforts had to be suspended “until we can determine the stability of the rubble” mainly in 5,6,13,15,19.
The damage to the pipes was reported around 5 hours after the earthquake, and on April 9 the department of health announced that the “extent of damage [...] is more than we initial[ly] thought”. The damage requires urgent action for sewer repair teams to avoid the spread of diseases and provide drinking water for those without access to power (for boiling) or bottled water. Depending on the infrastructure of the nuclear power plant this could also turn out to be extremely critical, because the plant uses water for cooling and it usually takes a while until the nuclear chain reactions largely stop after a controlled shut down. Furthermore, the prospect of contaminated water also causes problem for the hospitals. Around Southton “... all patients in the neonatal unit are being evacuated to other units …”. There are also reports about food poisoning.
For districts 1,2,5,6,16,19 there are indications that sewer service can be restored on April 10:
Rescuing people trapped in buildings is obviously important and time-critical. First, the elephants from the circus were ‘used’ to “move heavy debris” (resulting in complaints that “this is coercion and clearly unethical”). Bigger construction equipment as well as the rescue advance team arrived April 9, but they first have to get an overview of the situation to plan for the arrival of 12 building rescue teams.
Those teams finally arrived on April 10, two days after the earthquake.
Reports are coming in about a missing “famous singer”, but after one day there are good news that she is ok:
Worrying are reports by local media about heavy rain “that should develop on the morning of the 10th” which would make rescuing efforts even more difficult and challenging.
People camp in the park, which is prohibited. In the beginning, the authorities still enforce this rule, but later on they change their minds, which seems highly reasonable given the circumstances:
Following that policy, some move from the shelter to camp in the park.
3 Take the pulse of the community. How has the earthquake affected life in St. Himark? What is the community experiencing outside the realm of the first two questions? Show decision makers summary information and relevant/characteristic examples. Limit your response to 800 words and 8 images.
The events motivated many people in the affected regions to download and start using the Rumble App:
Others have had enough and are moving away:
Radiation monitoring by HSS is generally well received:
Celebrities support St Himark, Lacki performs a free concert and calls for a “moment of silence for all those and their animals who have been negatively affected”. And there are too many donations to be processed:
Politicians have a hard time, people express their frustrations about the mayor and the “red tags” on their homes after building inspection:
Some continue to enjoy life in the affected regions as if nothing had happened:
4 The data for this challenge can be analyzed either as a static collection or as a dynamic stream of data, as it would occur in a real emergency. Describe how you analyzed the data - as a static collection or a stream. How do you think this choice affected your analysis? Limit your response to 200 words and 3 images.
We developed our tool to support live streaming from Twitter, but for this challenge we opted to analyze the data as a static collection, with the possibility to narrow down the time range. The streaming mode may lead to a certain delay in detecting important patterns, because there has to be a number of posts talking about similar things to trigger the threshold. On Twitter, this usually happens pretty fast, but the size of this data set is rather small for real-time observations with only around 7 posts per minute on average (analysts could just read all of them).
Analyzing the messages statically makes it more challenging to perceive how events unfolded over time, but it is easier to see the whole picture. One is less likely to be “fooled” by some initial reports, because one can see the total number of (re)posts and related items that may contradict those reports. Furthermore, on a bigger data set word and pattern counts become more useful to assess the overall significance and to what extent different regions are affected.