IBM Research – STEM

VAST 2011 Challenge
Mini-Challenge 1 - Characterization of an Epidemic Spread

Authors and Affiliations:

Changhua Sun, IBM Research - China, [PRIMARY contact]

Weishan Dong, IBM Research - China,

Peter Bak, IBM Research - Haifa,

Harold-Jeffrey Ship, IBM Research - Haifa,

Lei Shi, IBM Research - China,

Heng Cao, IBM Research - China,

Zhong Su, IBM Research - China,


We use ArcObjects SDK 10 for Java (ArcGIS Engine 10) to build a standalone application to visualize the geospatial-temporal data. We exploit the ArcObjects SDK to create shapefiles, filter GIS features based on spatial location or attributes and process data based on spatial relationships. We also use ArcGIS Desktop 10 (ArcMap) to translate “Vastopolis_Map.png” to zones (polygon), river and lakes (polygon), hospitals, stadiums and city administrations (point) shapefiles. ArcMap is also utilized to create a map document (.mxd) for our standalone application. In addition, IBM Spatiotemporal Visual Analytics Workbench is used for advanced color mapping and temporal filtering, developed by IBM Research - Haifa / Israel.


We use VisWorks Peony visualization framework, developed by the members of Smart Visual Analytics team, IBM Research - China, between 2007~2011, to draw line charts to discover patterns in temporal distributions.  


We use Mallet  to preprocess the microblog text entries and extract relevant topics and keywords.


To provide quantitative analysis result and reasoning, we use the Generalized Spatial Association Rule (GSAR) mining tool developed by Spatial Analytics & Applications team, IBM Research - China, between 2010~2011, to mine spatiotemporal association rules from the data.


Our video 


MC 1.1 Origin and Epidemic Spread: Identify approximately where the outbreak started on the map (ground zero location). If possible, outline the affected area. Explain how you arrived at your conclusion.

The epidemic outbreak started on 5/18 around 8:00am from three landmarks: Vastopolis Dome, North of Vastopolis City Hospital, and Southwest of Convention Center.  The affected areas cover all Vastopolis zones, with Downtown, Uptown, Eastside, Smogtown the heaviest hit.


We arrive at this answer by:

l  Data preprocessing: apply text analytics on microblogs to extract flu-like topics and keywords

l  Visualization:

Visualize temporal trend of number of microblogs (Figure 1) and narrow down the outbreak range

Visualize the spatiotemporal distribution of flu-like microblogs on the cartographic map. Render the zones by flu report rate normalized by population (Figure 2)

l  Visual Exploration:

Compare the flu-like microblog distributions before and after the outbreak. Verify with the association rule

Explore the temporal spread of the epidemic with the flu report rate




Figure 1 Spatiotemporal distribution of flu-like microblogs around the outbreak



Figure 2 Spatiotemporal visualization of flu-like microblogs with flu report rate


MC 1.2 Epidemic Spread: Present a hypothesis on how the infection is being transmitted. For example, is the method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy treatment resources outside the affected area? Explain your reasoning.

A.    Hypothesis of the transmission


The infection is being transmitted person-to-person, by air, and by water. The epidemic was carried by the west wind to Eastside, and also brought by the Vast River to Smogtown and Plainville. In addition, it was spread from person-to-person on public transportation and crowded city-centers.  


On 5/20, many people with flu-like symptoms go to hospitals. But there are still many people who don’t go to hospitals. Thus, we suggest emergency management personnel deploy treatment resources, like notifying the City of Vast River downstream, and warn people especially in Cornertown, Villa and Southville to avoid public places like Vastopolis Dome and Convention Center.


B.    Analytics Processes


Figure 3 schematically depicts the analytic pipeline, consisting of three closely related and highly iterative parts: preprocessing, automatic analysis and mining, and visualization, described in the coming subchapters.



Figure 3 Pipeline of our analytics process


B.1   Data Pre-Processing


We input 1M+ microblogs into Mallet and use Latent Dirichlet Allocation method to train topics from the corpus. The number of topics is set to 10, out of that, we manually select 3 topics related to the epidemic. Then from the keywords of these 3 topics (each topic has 50 keywords), we further select 34 flu-like keywords, such as “flu”, “chill”, “pneumonia”, “stomach”, “diarrhea”, “fatigue”, “sweat”, “nausea”, etc. These two manual steps take us approximately 30 minutes initially and another 30 minutes later on to add/remove keywords based on analytics results. Finally, we scan the microblogs to match these flu-like keywords with the contents. Each microblog is then associated with 34 keyword tags, which indicate the presence of the keywords in the microblog.


We use ArcMap to manually translate “Vastopolis_Map.png” to zones, river, lakes, hospitals, stadiums and city administration shapefiles. This takes 2 hours.


We use ArcObjects SDK to create a microblog point shapefile with keywords attributions in 30 minutes. We use the SDK to identify the zone for each microblog point, and then compute the number of persons moving between two zones (movements). Two microblogs collected from two zones written by the same author on the same day but at different time, increase the movement between these zones by one. We translate the movements to a line shapefile. This whole process takes 30 minutes.



B.2   Association rule mining


The GSAR mining tool computes two spatial relationships, “close to” and “within”, between each microblog record and all the other spatial objects including public buildings, river, lakes, and zones. The “within” relationship is defined by topologically within. The “close to” relationship is defined as true if the geographical distance between a microblog record and a spatial object is smaller than 1km.


The rules are in the form of A=>B(s,c), where A and B are combinations of temporal attributes, spatial attributes, and keywords, s is support of the rule indicating how many records satisfy A and B, c is confidence of the rule indicating the probability of P(B|A)=P(AB)/P(A). Interpretation of the rule can be: if A happened, then B happened, with support s and confidence c.




B.3   Visualization


We exploit the ArcObjects SDK to develop a standalone application with flexible time range control and flu-like keywords filtering. We also render each zone by flu report rate, which is the number of authors reporting flu for the specific time normalized by the daytime population of the zone.


In addition, we use spatial relationship queries to identify the authors who go to hospitals on 5/20, and then add a tag indicating whether the author goes to hospitals to all the microblogs.


We integrate the Peony framework to draw line charts.


The visualization process takes us 3 hours.



B.4   Visual Reasoning


1.     Person-to-person


Figure 4 illustrates the visualization of people’s movements. The size of line represents the number of movements.  We see that Uptown, Suburbia, Northville, Westside, Plainville, Lakeside, and Eastside have large movements with Downtown.


Figure 2 visualizes the flu report rate and microblog points by time. It shows that some of the zones with large movements to and from Downtown also have a large flu report rate. Except for Smogtown, spread to other zones may be caused by people’s movements.



Figure 4 Visualizing the people’s movements indicated by the microblogs



2.     Airborne


We visualize the flu report rate and microblog points with flu keywords as shown in Figure 5 for 5/18-5/20. On 5/18, the epidemic spread from Downtown/Uptown to Eastside. This spreads may be caused by west wind on 5/18.



3.     Waterborne


In Figure 5, on 5/19, the flu report rate for Smogtown is less only than Downtown. On 5/20, the flu report rate for Smogtown is the largest.  The spread from Downtown/Uptown to Smogtown and Plainville may be caused by the Vast River which flows south.



Figure 5 Visualizing the flu report rate for each zone on three consecutive days


To confirm our hypothesis, we exploit association rule mining to discover correlation between flu symptoms, time, and landmarks.


As shown in Figure 6, over 76% microblogs with “diarrhea” were reported on 5/20, and were in the bottom-left zone area, close to the Vast River. This indicates “diarrhea” was spread along Vast River. Similarly, rules indicating the outbreak on 5/18 within Downtown and Uptown can also be mined.



Figure 6 Visualizing association rule mining result. Some typical rules and the related microblog data points are highlighted on the map



4.     Forecasting


As shown in Figure 7, we divide the flu-like keywords into two groups. For one group, the number of microblog decreases greatly or close to zero on 5/20, while the other group  is the opposite. The epidemic with symptoms in the first group can be considered as contained. Though many people with symptoms of the second group goes to hospitals on 5/20, there are still many people who don’t go to hospitals, especially in Smogtown, like “diarrhea” in Figure 6.


Therefore, we suggest emergency management personnel to deploy treatment resources.



(a)   Flu symptoms contained



(b)   Flu symptoms not contained

Figure 7 Spatiotemporal distribution of microblogs with two group keywords