Entry Name: “DPST-Natthawut-MC1”
Team Members:
Natthawut Adulyanukosol, Development and Promotion of Science and Technology Talents Project, na399@cantab.net PRIMARY
Student Team: NO
Tools Used:
Data transformation and analysis: R (bsts, tidyverse, coda, bayestestR, doFuture, zoo), mapshaper (for geographical file conversion), Python (PyMC3 tried initially but dropped later, as R and its libraries found to be more suitable for this project)
Visualization: Vega, VSUP, Tableau, ggplot2
Web application: Vue, Nuxt, Vuex, Element UI
Deployment & Hosting: Netlify, Tableau Public
Cloud services: Google Compute Engine, Google Cloud Storage
Version control system (VCS): Git, GitHub
Integrated development environment (IDE): VS Code, RStudio, Jupyter on Colaboratory
Prototyping & Image editing: Sketch
Data sonification: TwoTone
Approximately how many hours were spent working on this submission in total?
200 hours
May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2019 is complete?
YES
Video
4-min summary: https://www.youtube.com/watch?v=s28CX9V8pRI
Data sonification: https://www.youtube.com/watch?v=U9pX0mjLUSo
Live Dashboard
https://vast2019.maxnadul.com or https://vast2019.netlify.com
Visual Analytics on Tableau
Repository
https://github.com/na399/VAST-Challenge-2019-MC1
Please click on the figures to view them in full size.
The answers are structured following the four-level nested model for visualization design and validation [Munzner2009].
Fig 0.1 The four nested levels of visualization design from Munzner2014, licensed under CC BY 4.0
Limit your response to 1000 words and 10 images. (1000/1000 & 8/10)
Emergency responders have damage intensity reports from citizen over time. They need to find neighbourhoods that certainly need help the most in real time.


Eq 1
The reported ratings are modelled by Bayesian structural time series (BSTS) with a local level state (Eq 1) [Scott2013]. The model yields a posterior probability
distribution of the mean of rating (
) at each given time point in the time series. The mode of the
posterior probability distribution is defined as a maximum a
posteriori (MAP) probability estimate. The highest posterior density
interval specifies a credible interval (CI). To put it simply, a MAP
is the most likely value of the actual mean, and a CI shows the
subjective probability of the most likely interval of the
mean.
The main tasks of the visualization include discovery and summarization.
On this dashboard, the values and uncertainties of the MAPs are encoded with colours from the value-suppressing uncertainty palettes (VSUPs) [Correll2018]. (More details in the Q2 answer)
There are four visual representations.
All representations are assembled as Fig 1.1. They are in sync and interacting with one another.
Fig 1.1 The Dashboard with four visual representations
To perform data analysis, aggregation and transformation, we used R with several libraries, as mentioned above (scripts available here). The analysis is computationally intensive, involving Markov chain Monte Carlo sampling. Hence, to reduce the loading time, we ran this analysis parallelly prior to the visualization creation and used the processed results for the visualization.
All visual elements and interactions are rendered and handled by Vega [Satyanarayan2016] with our specification JSON files (available here). Vega performs data parsing that may take lengthy time to initialize the visualization. Streaming live data may shorten the initialization.
The top panel of the dashboard shows the damage reports by neighbourhood at a given time point. In an actual situation, this panel may show real-time updates.
Fig 1.2 Top panel of the dashboard at 8:45 AM on 8 April
When the first major quake struck the city, with Error Bar Chart and Map in Fig 1.2, we can see the shake intensity reported by the citizen with degrees of uncertainty via VSUP. The reports were consistent with the shake map, where the north-eastern neighbourhoods felt the shake more strongly than the western neighbourhoods.
Located close to the centre point of the quake, Old Town #3 and Safe Town #4 were hit the hardest. The mean shake intensity of Old Town #3 was more certain than that of Safe Town #4, as shown by the 95% CI and the different shades of red.
The reports from Wilson Forest #7 were the least certain, because, perhaps, there were only few reports as it is a lightly populated area.
Fig 1.3 Bottom panel of Old Town #3
The bottom panel shows the hourly aggregated MAPs on Heat Map and the temporal progressions of MAPs with the amounts of received reports. Fig 1.3 shows that the Old Town #3 received large numbers of reports with high values immediately after the first quake on all categories of reports. With the large numbers of reports, we can be certain about the MAPs, and hence, the intensity of damage. Therefore, responders should consider rescuing this area first.
Fig 1.3 also illustrates that there are two periods of data missing from Old Town #3 after both quakes, perhaps due to the ongoing work on the electricity system (Fig 1.4). However, the durations were significantly longer than expected. The responders should investigate the situation further by getting results from other sources.
Fig 1.4 Additional information table
Fig 1.5 Multiple Heat Maps
Fig 1.5 shows that there were many hours of missing data in various neighbourhoods. The data were also missing for a considerable amount of time in neighbourhoods #7 - #10 without known ongoing works. While Scenic Vista #8 and Broadview #9 felt the shake lightly, the damage reports in all categories were higher than the surrounding neighbourhoods after both quakes. The damage reports had intensified before the first missing period. Since Scenic Vista #8 is home to the elite, we may receive abnormally higher ratings of damage reports from these residents.
It should be noted that, after any period of missing data, we will see accumulated reports get recorded once, as in Fig 1.6. Hence, this information is of the past.
Fig 1.6 2,951 reports of power damage 10 at 12 PM on 10 April at Old Town #3
We can also find neighbourhoods with missing data on Map. For example, in Fig 1.7, Scenic Vista #8 and Broadview #9 had high building damage, but the data were over 1 hour old. Interestingly, East Parton #18 had fresh data with a higher value than its surroundings. The additional information (Fig 1.8) suggests that it has masonry facades, which might be damaged.
Fig 1.7 Map for Buildings at 9:35 PM on 8 April
Fig 1.8 Additional information table sorted by Buildings
Limit your response to 1000 words and 10 images. (1000/1000 & 8/10)
Crowdsourced data may vary markedly, especially when the reports are based on subjective measurements. Therefore, the emergency responders should be informed about the uncertainty in the reports.
We defined uncertainty by 95% credible intervals range (CIR). The 95% threshold was set arbitrarily, yet we obtained reasonable representations. A CI lower bound starts at 0. However, an upper bound can be over 10, which is beyond the rating we have, so we capped it at 10 when we calculated CIR.
We had 3 and 4 tiers of CIR depending on the VSUP palettes. The cut-off thresholds of CIR were set by the extent of each colour in each tier. For example, the most certain tier in the four-tier VSUP palette has 8 colours. Each colour then spans the rating range of 1.25 (10 divided by 8). As a result, CIR also ranges from 0 to 1.25. The following tier has 4 colours, spanning 2.5 ratings. Hence, the next cut-off threshold is 2.5. CIRs of 1.25, 2.5 and 5 are close to Q1, Q2 and Q3 of CIR (Fig 2.1).
The main tasks include discovery and comparison.
Fig 2.1 Histogram of CIR coloured by threshold sets & Cumulative distribution of CIR
The CIR can be visually quantified from the horizontal width of the graded error bar, or the vertical width of the grey area in Line Charts.
Despite the effectiveness of the length channel in accurately presenting the quantitative information, the length channel has limited scalability. Instead of length, the colour channel is used extensively on this dashboard. The colours from the value-suppressing uncertainty palettes (VSUPs) encode both the values and uncertainties of the MAPs. That is, the VSUP compresses two data attributes into one visual channel. This has at least three main benefits to this dashboard.
Data aggregation was done in R. Vega could also perform such task on the client browser, but the task is too computationally intensive.
With the VSUPs, we can use the Multiple Heat Maps (Fig 1.5) to approximately quantify the uncertainties and compare them among neighbourhoods visually. In contrast, the normal colour palette, as in Fig 2.2, the quantification and comparison of uncertainties are not possible.
Fig 2.2 The normal colour palette on Heat Map shows that every neighbourhood is in grave danger for close to the entire period of observation
To compare the reliability accurately, we transferred the CIR to Tableau. In Fig 2.3, we found that Wilson Forest #7 had the lowest median CIR, possibly because of scarce reports, yet homogenized.
Fig 2.3 Box plots of CIR by neighbourhood
As we are interested in the aftermath, we may set a temporal scope. For example, in Fig 2.4, the reports over 24 hours after the major quake show that Scenic Vista #8 had the highest certainty, while Downtown #6 varied the most.
Fig 2.4 Box plots of CIR by neighbourhood 24 hours after major quake
From Fig 2.4, we observed that shake intensity CIRs are on the low end, while medical and power CIRs are on the high end. Fig 2.5, Fig 2.7 and Fig 2.8 confirm such observations. In Fig 2.5, Cheddarford #13 had the highest median of CIR. The MAP from this neighbourhood in the middle panel shows that the ratings ranged from low to mid values. The high value of CIR can be explained by Fig 2.6. Cheddarford #13 initially felt the hit from the first quake. Three and a half hours later, its residents began reporting that they were no longer shaking, while some other residents continued to report the shake intensity of the past. This conflict resulted in the increasing uncertainty of the results. Therefore, the delay in reporting should be considered.
Fig 2.5 Box plots of CIR and MAP of Shake Intensity by neighbourhood 24 hours after major quake
Fig 2.6 Bottom panel of Cheddarford #13
In Fig 2.7, The CIRs of power spread out considerably, especially when compared to Fig 2.5 of shake intensity. Perhaps, it is infeasible to quantify the level of damage to the power system.
Fig 2.7 Box plots of CIR and MAP of Power by neighbourhood 24 hours after major quake
Fig 2.8 shows that the CI of medical can spread widely, especially in the neighbourhoods with abundant reports. The reason could be that medical needs vary from person to person, and populous neighbourhoods had greater variations.
Fig 2.8 Box plots of CIR and MAP of Medical by neighbourhood 24 hours after major quake
While the CIR from BSTS explains the certainty and possibly the reliability of reports, we might need to seek further information to substantiate the reliability for the following reasons.
Damage reports are largely subjective. There should be benchmarks for calibration. For instance, given the same damage, we may record the intensity level each neighbourhood typically reports.
Sparsely populated neighbourhoods have sparse reports. Consequently, the CIR could be high and, hence, the certainty could be low. To mitigate this, we may set different thresholds for CIR for these thinly populated neighbourhoods.
Having mentioned the population size, it might not correlate with the numbers of reports, though. As the demographic (age & gender), socio-economic status, and many other factors might influence the report frequency. The reports may also be spammed by users. Therefore, we might need to investigate further if we wish to make the visual analytics more reliable.
Limit your response to 500 words and 8 images. (500/500 & 5/8)
Situations and incoming reports are dynamic. Is the uncertainty dynamic too?
We still used the BSTS. According to Bayes’ theorem, the
posterior distribution (
) is proportional to the prior (
) times the likelihood (
), as Eq 2.

Eq 2
We ran BSTS on the data once we have at least 5 data points from a category of a neighbourhood. The prior was simply set by a normal distribution with a mean of the first data point and an SD of 0.1 of the SD of all data points, which was the default setting. While this prior setting yielded reasonable output, we might have adjusted the priors with our beliefs, such as the known shake intensity and ongoing construction work that might affect the reports.
The likelihood here is the probability of having data we observed, or the evidence given the model. The more the evidence we have, the certain the posterior is. Once a posterior distribution is obtained, it is then used as a prior distribution at the next time point recursively.
As same as Q1-2
As same as Q1-2
After each quake, we had more reports and hence more evidence to certainly support the posterior distribution. Fig 3.1 shows progressions of MAP and CI at Safe Town #4 during the first major quake on 8 April. On the left end before the quake, the CIs were wide due to sporadic reports. Then, in the immediate moments after the quake, ample reports came in and narrowed the CIs. As time passed, we received fewer reports, and the CIs expanded again.
Fig 3.1 Line Charts from Safe Town #4
We then explored the distribution CIR over time (Fig 3.2). We observed the same pattern that the reports became certain right after the quakes and gradually spread out.
Fig 3.2 Box plots of CIR by hour over time
When we inspected the CIR from each category, the pattern remained the same, as in Fig 3.3. However, the shake intensity reports had the opposite trend (Fig 3.4). The CIR increased during the quake and dropped later. The reasons for the elevated CIR could be that the intensity was measured subjectively, and compounded by the delay in reporting as mentioned in Fig 2.6. (Other categories can be viewed on Tableau Public.)
Fig 3.3 Box plots of CIR in Buildings by hour over time
Fig 3.4 Box plots of CIR in Shake Intensity by hour over time
Fig 3.5 Heat Maps of Neighbourhood #6 - #9
We may explore the uncertainty change over time by neighbourhood with Heat Maps via VSUP, as in Fig 3.5. Interestingly, ratings of some categories from Downtown #6 and Broadview #9 were relatively certain before the pre-quake. Downtown #6 has Trauma Hospital, which may explain the medical need. In both neighbourhoods, ongoing work was only on the road system, but the reports suggested problems in other categories.
Limit your response to 200 words and 3 images. (200/200 & 3/3)
We analyzed the temporal stream of data dynamically. For simplification, we subsequently combined the dynamic results as static files. In reality, we may run the analysis script on a server and stream the results to clients.
We used BSTS with a local level state that takes only preceding data to model the upcoming data, so using either a dynamic or static collection would, theoretically, produce similar results. However, if we were to use methods that rely on smoothing or rolling average, such as LOESS (Fig 4.2), the results would be different.
If we were to analyze a static dataset once by BSTS, we would still have had to annotate the periods when the reports were missing. When data is missing in time series, BSTS pursues a complete random walk, which results in an expansion of uncertainty (Fig 4.3) rendering it impossible to assess the situation. In reality, the missing data could be due to power outage, while the damage remains as is or aggravates. Therefore, the visualization should flag missing data to let responders realize that they must get information via other means. The latest data can be displayed to guide the decision but should be used cautiously (Fig 1.3).
Fig 4.1 Comparisons of BSTS models on a stream of dynamic data (blue) a static dataset (green) of shake intensity at Old Town #3. The model with static data rose before the quakes, possibly due to interpolation in the model when data was missing.
Fig 4.2 Comparisons of plotting models on a static dataset of shake intensity at Old Town #3: LOESS (salmon), rolling average with window of 5 (turquoise), average (grey)
Fig 4.3 Random walks of BSTS when data is missing
Correll M, Moritz D, Heer J (2018), “Value-Suppressing Uncertainty Palettes,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems - CHI ’18, vol. 272, no. 7286, pp. 1–11, doi:10.1145/3173574.3174216.
Munzner T (2009), “A Nested Process Model for Visualization Design and Validation,” IEEE Trans. Vis. Comput. Graph., vol. 15, no. 6, pp. 921–928, doi:10.1109/TVCG.2009.111.
Munzner T (2014), Visualization Analysis & Design. Boca Raton, FL: CRC Press.
Satyanarayan A, Russell R, Hoffswell J, Heer J (2016), “Reactive Vega: A Streaming Dataflow Architecture for Declarative Interactive Visualization,” IEEE Trans. Vis. Comput. Graph., vol. 22, no. 1, pp. 659–668, doi:10.1109/TVCG.2015.2467091.
Scott S L, Varian H R (2013), “Predicting the Present with Bayesian Structural Time Series,” SSRN Electron. J., pp. 1–21, doi:10.2139/ssrn.2304426.
Shneiderman B (1996), “The eyes have it: a task by data type taxonomy for information visualizations,” in Proceedings 1996 IEEE Symposium on Visual Languages, pp. 336–343, doi:10.1109/VL.1996.545307.