Student Team: YES
KNIME
Tableau
Gensim (Python library for natural language processing)
Natural Language Toolkit (NLTK) for Python
Python 3: Matplotlib, Pandas
D3.js
Approximately how many hours were spent working on this submission in
total?
500
May we post your submission in the Visual Analytics Benchmark
Repository after VAST Challenge 2019 is complete? YES
Video
Youtube video: IEEE VAST Challenge 2019 MC3 – RescueMark | Walk-through
Web tool:
https://vast19.dbvis.de/vast-mc3/
Questions
The City has been using
Y*INT to communicate with its citizens, even post-earthquake. However, City
officials needs additional information to determine the best way to allocate
emergency resources across all neighborhoods of St. Himark.
Your task, using your visual analytics on the community Y*INT data, is to
determine the types of problems that are occurring across the St. Himark. Then, advise the City on how to prioritize the
distribution of resources. Keep in mind
that not all sources on Y*INT are reliable, and that priorities may change over
time as the state of neighborhoods also changes.
1 – Using visual analytics, characterize conditions
across the city and recommend how resources should be allocated at 5 hours and
30 hours after the earthquake. Include
evidence from the data to support these recommendations. Consider how to allocate resources such as
road crews, sewer repair crews, power, and rescue teams. Limit your response to
1000 words and 12 images.
METHODOLOGY
A characteristic of text data, especially that from social media, is that it contains a great deal of noise and inaccuracy. Due to this fact we had to apply an extensive set of data preprocessing and data mining methods to extract meaningful information from the given dataset. First of all, we used a spell-checking algorithm to correct out-of-dictionary words contained in some messages. Correcting misspellings and controlling for different languages lead to a cleaner dataset. However, it still included a large share of unrelated messages which could potentially distort the results. For example, there were some messages such as “appealing, trembling. What better deals can you ask for?”. Here, it would be misleading to falsely attribute the word “trembling” to the earthquake because it is used in a completely unrelated context, i.e. for advertisement purposes. To minimize such distortions, we transformed the messages into word embeddings using fastText and applied k-means clustering to remove clusters of unrelated messages from the dataset. The last processing step was to assign messages to several categories. For this, we applied the Latent Dirichlet Allocation (LDA) algorithm which attributes a topic for each message. For the top 20 topics, the 24 highest weighted words were selected as query words for message retrieval in the fastText vector space.
VISUAL ANALYTICS TOOL
Our visual analytics tool RescueMark enables emergency responders to monitor the ten major resources and threats in the community. These are food, gas, medical assistance, nuclear radiation, power supply, rescue teams, transport infrastructure, sewer system, shelters and housing, and volunteers. The sunburst diagram in the left section of the dashboard displays the percentage of messages related to these categories. By default, the distribution for the entire city is plotted. It can optionally be filtered by district.
The districts are accessible via the city map in the center of the dashboard. The color of the districts encodes the number of damage reports sent from there. At the bottom, there is a slider that visualizes the overall message activity of St. Himark over time. Operators can select a specific time range for analysis. Important events are highlighted by the colored dots. Below the sunburst diagram, word clouds allow users to observe trend topics in the form of frequent terms and hashtags. These are also filtered by district if desired. The right panel lists important events for St. Himark. Clicking one of the events shifts the time frame accordingly, so information related to this point in time are being shown. The buttons in the top bar are the access points for the operators to read the original messages. They are divided into messages that were identified to be directly related to the disaster situation and those that belong to more general topics. We called this “voice of the community”. The district filter also applies to the messages.
Figure 1 - General Overview of the data 5 hours later after the strongest earthquake
OBSERVATIONS
First, we focused on identifying exact earthquake times in order to detect fatalities, damages, and condition changes within the community. We identified three earthquakes in St. Himark. The first one was a mild earthquake on the 6th of April around 1 pm. The second earthquake happened on the 8th of April around 8 pm and was the strongest and most destructive one. The last earthquake happened on the 9th of April at 2 pm. In the rest of the analysis, we refer to the earthquake of the 8th of April.
The dashboard at 5 hours after the strongest earthquake is shown in figure 1. We can easily detect the most affected districts. The five districts of Weston, Downtown, Northwest, Southton, and Southwest, for example, can be prioritized based on their color. To be more precise, the community was affected by many accidents such as multiple fires while lacking water, communication system overloads, trapped cars, closed bridges, damaged roads, floods, and contaminated water and food. These categories are shown in figure 2.
Figure 2 - General View of St. Himark after the strongest earthquake later 5 hours based on different resources
When comparing the conditions at 30 hours and 5 hours after the earthquake, we can observe that food and gas problems occurred in addition to the initial issues. For this reason, food distribution and repair crews for the gas leaks would be necessary for providing the safety of the community. By contrast, a decrease in the need for the four other resources can be observed: Power, Rescue, Transport, and Volunteers. According to reports of the community, electrical power has been restored to 80% after the earthquake, many missing people have been found by the rescue teams such as the famous singer Lacki Dasical. Furthermore, aid and emergency centers, free food supply and family assistance centers are established. Most bridges are reopened but are highly damaged. However, for the rest of the resources, we observe an increase in their importance. For example, the need of shelter almost doubles from 11% at 5 hours after the earthquake to 20% at 25 hours later. This is due to collapsed buildings such as the High School and Ommic Elementary School, libraries and the many people who lost their homes. At the same time, there are heavy damages in the sewer pipes, water contamination, and floods in several districts. These factors add to an increase in the need for sewer repair crews. Furthermore, there is an increase in medical needs in the community. As a result of this, ambulances and first aid teams should be sent to the affected areas. These changes can be seen in figure 3.
Figure 3 - General View of St. Himark after the strongest earthquake later 30 hours
Moreover, the total resource needs of districts over time can be seen in figure 4.
Figure 4 - Resource Needs of districts in St. Himart over time
2 –
Identify at least 3 times when conditions change in a way that warrants a
re-allocation of city resources. What
were the conditions before and after the inflection point? What locations were affected? Which resources are involved? Limit your
response to 1000 words and 10 images.
METHOD DESCRIPTION
To find the changes of conditions, we applied the following methodology: After retrieving data that matches the various resource categories, we calculated the aggregated message counts for one hour bins. We could use this frequency representation to extract events with the help of Kleinberg's burst detection algorithm. We see it as a hint that conditions change when peaks occur in multiple categories at the same time. The result of the burst analysis can be observed in figure 5.
Figure 5 - Message burst analysis on messages queried with 24 topic terms
OBSERVATION DESCRIPTION
In the aftermath of the strongest earthquake, which happened on the 8th of April at 08.36 am, there are three important turning points of the conditions in St. Himark. The first change of conditions can be observed on the 8th of April at 09.36 am. The second change happened on the 8th of April at 13.00 pm, and the last one on the 9th of April at 09.00 am. These changes are shown in figure 6.
Figure 6 - Labeled event plot with most important filtered events by category
During the first change, we do not observe much variation in the percentage of overall resource needs in St. Himark. However, we observe that the district priorities have changed, which is visible in the chord charts. For example, Palace Hills needs more help and medical assistance for the citizens. There are more sewer and pipe damages in Weston, Downtown and Palace Hills compared to previous conditions. Furthermore, in Old Town, we can observe that power needs are replaced by medical needs.
As for the second condition change, when we look at the chord charts of different districts, we can easily identify some changes in resource needs. For example, shelter needs are emerging in the following districts: Safe Town, Old Town, Southwest, Weston, and Palace Hills. High increases of power shortages can be observed also in Pepper Mill, Palace Hills, Downtown, and Weston. In Cheddarford, the water percentage is increasing, as can be seen in figure 7.
Figure 7 - Condition changes in RescueMark
The third condition change is characterized of an increasing shortage of food in most districts. In contrast, transportation infrastructure reports are decreasing because bridges are reopened. To sum up, we observe the general pattern of resource priorities not changing on the city-wide scale but their distribution between districts.
3 – Take
the pulse of the community. How has the
earthquake affected life in St. Himark? What is the
community experiencing outside the realm of the first two questions? Show
decision makers summary information and relevant/characteristic examples. Limit
your response to 800 words and 8 images.
Life in St. Himark has been affected gravely by the crisis. Many citizens are considering to move out of the city because of the traumatizing succession of events. At least 0.6% of the population report that they are going to leave the town permanently. The disaster has caused many fatalities. These numbers, however, are barely confirmed by official sources, so there is a high dark figure (five times higher than the confirmed fatalities). This is due to unreliable reports of missing people and fatalities, as shown in the figure below. The confirmed fatalities amount to 1.6% of the population, whereas the unconfirmed fatalities sum up to 8.5% of the population.
The fatalities were detected through clustering of the messages after transformation into word embeddings using the fastText algorithm. From the respective cluster, we extracted the number of fatalities by analyzing the textual patterns. We found five different patterns in the “fatalities” dataset, as shown in figure 8.
Figure 8 - Fatality report patterns
The patterns reveal that some fatality reports are not formulated in a very confident way, what is expectable in chaotic emergency situations. We have incorporated this kind of uncertainty into our tool by separating the data into “confirmed” and “unconfirmed” data reports. The two classes symbolize the upper and lower confidence intervals of the fatalities reports. The “confirmed” reports include messages of patterns 1, 2 and 5. In the case of pattern 1, we have selected the number which is reported by the news. The number given by a named person is categorized into the “unconfirmed” reports. The “unconfirmed” reports can be thought of as rumours or hearsay.
The chart shows the evolution of the summed fatality reports for each class over time.
Figure 9 - Evolution of fatality reports over time, confirmed and unconfirmed
The tool RescueMark shows the current summed fatalities for the selected time period in the top bar of the window (see figure 10).
Figure 10 - Evolution of fatality reports over time, confirmed and unconfirmed
Due to the high number of casualties, the St. Himark community is rallying to support the victims with different social activities. The messages related to social events in St. Himark are shown in the tool through a “Voice of St. Himark message section, which features important social announcements. To mention a few examples, on the evening of the strongest earthquake on the 8th of April, families and friends of missing people hold a vigil at the statue of the founder of the city of St. Himark, Joseph Kibble. On top of that, the famous singer Lacki Dasical and other celebrities offered a free concert on the evening of the 10th of April at Union Hall in Cheddarford. Lacki went missing during the earthquake, what made many of her fans worry, but she got rescued.
Figure 11 - Message section: Voice of St. Himark
The different departments of the city of St. Himark also share relevant information with the inhabitants through Y*INT. Official accounts are used to propagate important warnings or announcements to guide the population. The table below shows the account and the number of messages sent through the account.
We have determined these accounts through the use of the provided information about the city of St. Himark to query the account names and extract the number of direct messages sent via official accounts. During the data exploration we have encountered 95 messages that were sent from accounts not disclosing their location for “contractual reasons”. After analyzing the given subset of messages, we have discovered that these accounts correspond to the members of the Himark Science Society (HSS) who sometimes use their personal account names to send messages via Y*INT. Most messages from this subset start by mentioning where the information in the message is coming from. We have counted those messages in the “indirect” message section of official accounts.
Figure 12 - Number of messages per official account
These messages have been summarized in the tool in the “event” section on the right side of the dashboard. They are listed in chronological order and allow selection to filter the time interval according to the time of the announcement. The figure below shows the events from the dashboard.
Figure 13 - Most important events section
4 –– The
data for this challenge can be analyzed either as a static collection or as a
dynamic stream of data, as it would occur in a real emergency. Describe how you analyzed the data - as a
static collection or a stream. How do
you think this choice affected your analysis? Limit your response to 200 words
and 3 images.
We analyzed the data as a static collection. There are two main reasons for this decision. First, the availability of algorithms is much greater in the static case. Among the algorithms we used (e.g. Latent Dirichlet Allocation, fastText, k-means clustering, named entity recognition and logistic regression) only a few are available as online implementations. With a wider range of applicable computational methods at hand, having analyzed the data in this way had a positive effect on the quality of our results.
From a practical perspective, we argue that the static approach is a valid solution because, in a real-world scenario, our product would be trained on historical data before deployment. Once all available data is labeled and the models, such as word embeddings and classifiers, are optimized, our system is ready to handle an incoming data stream. This includes the extraction of relevant messages and assigning these to a resource category, based on the knowledge learned from the training data.
We refrained from simulating such a stream with the provided dataset because it would take away too much training data, considering that the total number of messages (without “retweets”) is only about 13,000 messages.