Entry Name: "UKON-Arjun-MC1"

VAST Challenge 2019
Mini-Challenge 1

 

 

Team Members:

Arjun Majumdar, University of Konstanz, arjun.majumdar@uni-konstanz.de PRIMARY
Gent Ymeri, University of Konstanz,
gent.ymeri@uni-konstanz.de
Sebastian Strumbelj, University of Konstanz, sebastian.strumbelj@uni-konstanz.de
Juri Buchmueller, University of Konstanz, buchmueller@dbvis.inf.uni-konstanz.de
Udo Schlegel, University of Konstanz, schlegel@dbvis.inf.uni-konstanz.de
Prof. Dr. Daniel Keim, University of Konstanz, keim@uni-konstanz.de

Student Team: YES

 

Tools Used:

Tableau

Python

PostgreSQL

Flask

Javascript

D3

sklearn

pandas

matplotlib

numpy

LightGBM

XGBoost

hyperopt

scipy

Psycopg2

Earthquake Emergency Response Resolution Tool, developed by the University of Konstanz Visual Data Analysis class, taught Spring 2019

 

Approximately how many hours were spent working on this submission in total?

500

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2019 is complete? YES

 

Video

https://youtu.be/sivii5cKNX8

 

Host

https://vast19.dbvis.de/vast-mc1/

 

 

Questions

1 Emergency responders will base their initial response on the earthquake shake map. Use visual analytics to determine how their response should change based on damage reports from citizens on the ground. How would you prioritize neighborhoods for response? Which parts of the city are hardest hit? Limit your response to 1000 words and 10 images.


Our tool has at its heart copious amounts of machine learning whose results are then mapped to the shake map which is color coded to represent the locations of the city of being most affected in terms of damage in a descending manner. For this purpose of ascertaining the overall damage for a location, we have computed Estimated Damage as our metric. In order to compute this, we trained several competing ML models, out of which we picked out the one giving us highest model metrics (such as accuracy score, precision and recall). Since, we didn’t know which attribute should be the target variable, we brute forced all of the existing variables to be the target. The target attribute giving the highest model metric is location while the ML model giving the highest performance is LightGBM classifier which is then coupled with hyper-parameter optimizations such as Bayesian Optimization using Tree Parzen Estimator, followed by RandomizedSearchCV and finally GridSearchCV to give us the ‘best’ parameters for the given dataset. This ML model was then used to generate the feature importance scores for each of the attributes within the dataset.

The original dataset had a lot of missing values for the different attributes viz., medical, shake_intensity, sewer_and_water and buildings. Therefore, in order to perform data imputations, instead of using matrix completion techniques, we used a ML model where the attribute being imputed serves as the target. Also, to split the target and features into training and testing sets, the dataset having incomplete data became the test set, which is to be predicted using the ML model, while the training set is the dataset having complete data on which our model was trained on. In our experiments, we found LightGBM and XGBoost models to have a higher performance as compared to Random Forest classifier. Consequently, LightGBM classifier was used for data imputation.

An important observation we learned during data imputation was that while imputing for a particular attribute, we should not include already imputed attribute from a previous model. As this biases the results. For example, if we are trying to impute for medical attribute, we should not include already imputed values from other attribute(s) like buildings, etc.

 

The slider at the bottom represents the mean of Shake Intensity attribute of the dataset.


The slider and Set time window (in min) text box within our visualization gives a From & To time window, which can be dynamically changed by the user. In order to generate the final ranking of all locations in a descending manner, we do the following computations:

Estimated Damage metric is then aggregated for each location depending on a From & To time window to come up with a final ranking of locations in a descending manner which is conveyed in the shake map using color coding. This is how the prioritization of the neighborhoods is done.

The locations hardest hit are 8, 3, 9, 14 and 2. 

In the image above we see a 20 minute time window from April 6th 05:40 till 6:00. The neighborhoods are prioritized like this: Safe Town, East Parton, Downtown, Southton, Broadview and so on. Emergency responders should response to the first 5 most hit locations which in this case are Safe Town, East Parton, Downtown, Southton, and Southwest. Looking closer at Safe Town we see that the reports from citizens on the ground suggest that Buildings are the hardest hit, then Power and Medical, followed by Roads and Bridges. 

In this image the neighborhoods are prioritized with Weston being the most hit followed by West Parton, Old Town and so on. Therefore this is how the response should change, responders need to respond firstly to Weston followed by West Parton and so on. But primary in this time window is Weston which is affected the most.

Whereas here, after 25 minutes in this picture we see that the response should change from Weston which is now listed in the 6th position to Downtown which is the hardest hit in this time window, followed by West Parton, Southwest, Northwest and so on.

Moving on further, after 35 minutes with a time window of 20 minutes so from (April 6th 23:25 till April 6th 23:45) we have different locations that need more help, starting with Northwest, Oak Willow, West Parton, Safe Town and so on.

 

Moving ahead in the time window

On April 8th at 2:55 till 3:15 the situation changes, we have now greener neighborhoods. With the top 5 most hit locations being Pepper Mill, Southwest, Northwest, Chapparal and Downtown. Therefore the rescue team should prioritize these neighborhoods for this time being. In Northwest we see that people need mostly Medical help followed by Roads and Bridges and Electricity. 

 

Furthermore,

In this image conditions change again and become worse than in the previous image. Some neighborhoods such as Old Town, Scenic Vista, Broadview and Chapparal are not reporting at all. Whereas East Parton is listed on top as the most hit. Also we notice a rise in Shake Intensity on the average of features.

 

Approximately 30 minutes later conditions change to being more ‘greener’ as shown in the following image:

East Parton goes down the list in the 3rd place and Oak Willow is up in the 1st place. But overall the conditions are considerably better than in the previous hour.

 

Approximately 5 hours later we have a different scenario, some neighborhoods have gone ‘dark red’ with East Parton being at the top. The top 3 most hit locations therefore need immediate help. 

 

Next day:

April 9th 11:45 till 12:05 conditions have changed with the neighborhood needing the most help being Palace Hills followed by Easton, Southwest and so on.

 

Lastly we go towards the end,

April 10th 22:50 till 23:10 we have Downtown who needs the most help followed by Palace Hills and so on. East of the city seems to be okay.

2 Use visual analytics to show uncertainty in the data. Compare the reliability of neighborhood reports. Which neighborhoods are providing reliable reports? Provide a rationale for your response. Limit your response to 1000 words and 10 images.


Our tool computes and quantifies uncertainty in an easy to understand manner. We plot the uncertainty on top of each neighborhood in the existing visualization to keep things simple.

We encode uncertainty using ’Extrinsic Uncertainty Visualization’[1] which uses the occlusion metaphor to encode uncertainty. So, in our visualization the more occluded an area is the more uncertain it is and vice versa.

 

We have again used Machine Learning models to compute the feature importance for each of the 6 attributes for the 19 locations such that the attribute giving the highest model metrics such as accuracy score, precision and recall was chosen as the target attribute for that location.

For example, for locations 1, 3 and 13, the target attributes are medical, shake_intensity and medical.

We used LightGBM gradient boosting classifier to compute the feature importance rankings for the 6 attributes within each location.

During this ML model training, we found the following observations for the 19 locations:

  1. The most common target attribute was medical occuring in 13 locations, followed by shake_intensity occuring in 3 locations, followed by roads_and_bridges and power occuring once.
  2. Locations 5, 6, 9, 11, 16 and 17 have low model metric scores in terms of accuracy, precision and recall, thereby hinting that for these locations, the ML model was unable to capture underlying patterns in the data to predict the target attribute. This might mean that the data existing in these locations is unreliable or varies a lot.

 

The mathematical model we have used to capture and compute uncertainty is as follows:

  • Coefficient of variation (CV):

“The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the standard deviation to the mean (average).

For example, the expression “The standard deviation is 28% of the mean” is a CV.

The CV is particularly useful when you want to compare results from two different surveys or tests that have different measures or values.

For example, if you are comparing the results from two tests that have different scoring mechanisms. If sample A has a CV of 12% and sample B has a CV of 25%, you would say that sample B has more variation, relative to its mean.”[2]

  • Medians for each of the 6 attributes for a location
  • The uncertainty in a given time window t in a location loc is computed as the following:

In words: The medians (because it is very robust against outliers) for each of the attributes along with the Coefficient of Variation for each of the attributes is computed for a particular time window. While the feature importance for each of the attributes within a location is fixed. The sum is then used because it includes all values into one final score. Different value distributions can result in the same sum, but these can be viewed in detail on demand in the radial bar charts for each location. In the end, this result is multiplied with the number of entries in the given time window for every location. Because the more data we have, the more confident statements about data become. This results in a non-uniform distribution of the uncertainty values, which would not be possible without the number of entries, which is shown in the following graph.

According to our computations, the neighborhoods ranked in terms of decreasing uncertainty for the top 5 uncertain neighborhoods are: 8, 3, 9, 14 and 2.

8 followed by 3 (which is 20% less than 8), followed by 9 (which is 23% less than 3), followed by 14 (which is 34% less than 9) and finally 2 (which is 19% less than 14).



To make it simpler for the ‘rescue teams’ we show the results in 3 bins. The bins are: 1) certain, 2) medium, and 3) uncertain. This way the comparison of reliability of reports between different neighbourhoods is made ‘easier’. As shown in the following figure for the time interval (April 07 from 12:40 until 13:40) the neighbourhoods: Old Town, Northwest, Downtown, Weston, Southwest, Southton, Pepper Mill, and Broadview are the ones who provide the most reliable data, followed by Easton and Scenic Vista. The remaining neighborhoods are the most uncertain, excluding Wilson Forest which provides no data or only one entry (this is denoted by light gray color).

Next, we have another image with a time window of 20 minutes which is taken during April 9th 17:50 till 18:10. During this time window we can see that most of the neighborhoods are reliable(certain), whereas neighborhoods Safe Town, Easton, Pepper Mill, Weston, Southton are providing ‘medium’ reliable(certain) reports. Neighborhood Broadview on the other hand is providing the least reliable reports (uncertain). And lastly, Old Town and Scenic Vista have no data entries for this time window.

3 How do conditions change over time? How does uncertainty in change over time? Describe the key changes you see. Limit your response to 500 words and 8 images.

As can be seen in the beginning of the reports received from the people of St. Himark, everything is good in the beginning but about half of the locations become uncertain. On 06-04-2020 at around 08:00 hours, medical, roads and bridges and sewer and water attributes are most affected. While buildings attribute gets added over time.

In addition to these observations, the south part of the city is more certain than their northern counterparts. With the passage of time, neighborhoods become more uncertain. Before the first earthquake, almost all of the neighborhoods are uncertain and power and roads and bridges attributes are the most affected. Also, during the first earthquake, every neighborhood is in green in terms of estimated damage and thereby seems to be fine.

Right after an earthquake, for this period of time which is April 6th 2020 from 20:40 till April 6th 2020 21:00, we can see that the conditions changed more dramatically. Some of the neighborhoods turn red and orange, with every feature which are being affected except for shake intensity. There are 10 locations which are giving uncertain data, 3 locations are in between certain and uncertain (medium), 3 locations are certain and we have another 3 locations giving no data at all.

A bit more after the earthquake on April 6th 23:30 till April 6th 23:50 we have more neighborhoods affected in attributes such as Roads and Bridges, Medical, Power, Sewer and Water, Building except Shake Intensity. With the uncertainty having 6 neighborhoods with certain data, 3 neighborhoods having medium uncertain, 5 neighborhoods being uncertain and 4 neighborhoods that provide no data.

Further on April 7th 9:00 till April 7th 9:20 we see a different situation where we have neighborhoods which have a green color. With attributes being slightly affected except Shake Intensity. We have 7 neighborhoods providing no data at all, and we have 3 neighborhoods being certain, 3 of them are meidum, and 6 being uncertain.

4 The data for this challenge can be analyzed either as a static collection or as a dynamic stream of data, as it would occur in a real emergency. Describe how you analyzed the data - as a static collection or a stream. How do you think this choice affected your analysis? Limit your response to 200 words and 3 images.

We have chosen a static collection of data, as this approach grants us a variable window, which otherwise would not be possible in a stream based approach. The other reason for us choosing static data is that in stream based approach, there can be a small amount of data in a given time window and can thereby make the training of ML models insignificant. For example, it doesn’t make sense to train a model for only 15 data points.

As seen in the image we have a time window of 20 minutes, therefore making it easier for us to analyze the data as we are fetching more data than we would for example in a 5 minute batch, where many locations are not reporting any data:

The shorter the time interval you have, the fewer the data we get.

One of the techniques for computing the approximation to the actual value(s) in a data stream based query is to evaluate the given query not over the entire history of the data stream, but rather over the sliding window of the most recent data from the streams.

 

 

References:

[1] Jäckle, D., Senaratne, H., Buchmüller, J., & Keim, D. A. (2015). Integrated Spatial Uncertainty Visualization using Off-screen Aggregation. In EuroVA@ EuroVis (pp. 49-53).

[2] https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/how-to-find-a-coefficient-of-variation/