Entry Name:  SMU-HS-MC2

VAST Challenge 2018
Mini-Challenge 2

 

 

Team Members:

Harisingh Khedar, Singapore Management University, harisinghk.2017@mitb.smu.edu.sg PRIMARY

Saurav Jhajharia, Singapore Management University, sauravj.2017@mitb.smu.edu.sg

 

Student Team: YES

 

Tools Used:

Tableau

Microsoft Excel

R

 

Approximately how many hours were spent working on this submission in total?

127 hours

 

May we post your submission in the Visual Analytics Benchmark Repository after VAST Challenge 2018 is complete? YES

 

Video

https://www.youtube.com/watch?v=FcKNoJ3dSQk&feature=youtu.be

 

 

 

Questions

  1. Characterize the past and most recent situation with respect to chemical contamination in the Boonsong Lekagul waterways. Do you see any trends of possible interest in this investigation?  Your submission for this question should contain no more than 10 images and 1000 words.

 

A comparison of the past and most recent situation gave us many trends of possible interest worth considering. Let us look at all those instances in an orderly fashion.

a.    Overall trend: The first step was to look at the macro level view of the number of readings taken across all locations for all data collected over the years. This gave us an indication that the number of data samples collected each year for all locations were not constant.

After an upward trend until 2007, there has been a drop in the number of readings collected in recent years. Thus, by the look of it so far, it seems that the data collection system used for this sampling was manual.

b.   Location wise trend: Going deeper, when we tried to understand the patterns of data collection by location, it was clearly visible that the number of readings collected over each month every year was different in every location. However, there was a visible pattern amongst locations in the manner of data collection.

i)              Achara, Decha, Tansanee: These are locations where data was collected from 2009 onwards only. (marked in red)

ii)             Busarakhan, Kohsoom, Somchair: These are locations where data is spread over the years, but the range of data collected for chemicals is very small.

iii)            Sakda, Kannika, Chai, Boonsari: These are locations where data is spread out with a very high range of values for all chemicals.

This trend also indicates that there will be missing data for certain years (and months) in multiple locations. Thus, we decided to take a deeper look into this data and get rid of sparsity and clearly identify all chemicals of possible interest and eliminate the ones that aren’t important.

c.    Sparsity filtration: Here, we looked at each chemical’s total number of records individually across all locations over the entire period of 17 years. We found that there were many chemicals across many locations whose data wasn’t sufficient and/or relevant enough for our analysis. Hence, we filtered out the chemicals which had the following characteristics:

i)             Didn’t have data collected in recent years (2014 onwards) and had data only for timeline before that.

ii)           Didn’t have sufficient number of records in all years and had gaps between years.

A few examples of such chemicals are attached herewith for reference.

 

 

As visible in below chart (marked in red), there are missing values for “Total hardness” across all locations except Boonsari and Kohsoom from 2010 to 2014. Thus, except these two locations, the records of “Total hardness” were filtered out from all other locations from the dataset.

 

After filtering out such chemicals, we were left with chemicals that had data collected for over the years (at least from 2014 onwards) with no gaps in the middle. This dataset was then used for analysis going further for Q2 where we had to find anomalies in chemicals w.r.t their values in different locations across the map.

 

d.   Irregularities: While doing the analysis above, we came across certain irregularities that we thought would be worth mentioning here for the knowledge of our investigators. There were certain chemicals that showed a strange similarity in the way their data was collected over the years.

As shown below, chemicals like Ammonium, Nitrates, Nitrites, and Orthophosphate-phosphorous & Chemical oxygen demand (Cr & Mn) and Chlorides showed the similarity in the pattern in which data was collected for them when their number of records were aggregated across all locations.

 

The following box plot indicates that there are sets of 2 locations each where the trend in the number of samples collected was strangely identical in nature. These location sets are Busarakhan and Somchair & Kannika and Sakda respectively.

  1. What anomalies do you find in the waterway samples dataset?  How do these affect your analysis of potential problems to the environment? Is the Hydrology Department collecting sufficient data to understand the comprehensive situation across the Preserve? What changes would you propose to make in the sampling approach to best understand the situation? Your submission for this question should contain no more than 6 images and 500 words.

The cleaned dataset from the analysis above is imported for further analysis to find anomalies in the waterway samples dataset. This is because the basic criteria for finding anomalies is that the data should be sufficient enough and without sparsity. Hence, the data obtained from Q1 is used.

 

Anomalies in Chemicals: For showing anomalies, we have picked out several chemicals to indicate some of their visible peaks across years and months.

 

Note: Each chemical’s peak is defined differently. We have taken Outliers as > Average + 3 Standard Deviations of that chemical. For example, if the average value of Ammonium across all locations is 0.5 and the value of +3 Standard Deviations is 0.8, then, for all locations, the values of that chemical > 0.8 is considered as an Outlier (or peak value).

 

We have used a dot plot for our analysis because since the data isn’t collected on a regular daily basis, the line chart didn’t seem like the right way to represent the data. If there was a data collection pattern seen every day, a line chart would be a more appropriate visual to depict the rise, fall, or constant behaviour in the recorded average values of chemicals.

 

To get an understanding of the behaviour of location wise change in chemicals, we have distributed the locations into 4 basins as per the flow of water indicated in the map. This helps us show the changes w.r.t different basins and might also assist in locating the Kasios Furniture Company’s factories as they might also be dumping chemicals directly into rivers.

 

Chemical 1: Methylosmoline

Chemical 2: Chlorides

 

As indicated below, the chemical Chlorides shows a sharp presence of outliers in Tansanee. Thus, it might be valuable for the investigator to view these and map these values with the results of the soil analysis to check if there is any similarity in peak values of this chemical.

 

Chemical(s) 3: Total Phosphorous & Orthophosphate phosphorous; Location: Kohsoom

 

If we look at the two chemicals above, we can see that Kohsoom has a substantial number of outliers for both these locations in a similar period with a very identical trend in the values over the years. While there is a difference in the absolute values, this is a very strange coincidence visible in their trend.

 

The analysis above is visible for all chemicals in each location using the Tableau public sheet present here.

 

Irregularities: During our analysis, we found some strange coincidences between chemicals and locations, both. We would like to highlight them for your reference here below.

 

Chemical: Magnesium; Locations: Kannika & Chai

 

 

It is interesting to note that both these locations fall in the same basin. There is a high chance that due to a dumping of Magnesium in March 2011, this peak in their values was visible.

 

A similar similarity was visible between Sakda and Kannika for the chemicals Macrozoobenthos, Total dissolved phosphorous, and Total extractable matter.

 

As indicated earlier, all the visualizations shown above can be accessed via Tableau public here.

 

To help our investigators toggle between different locations and the behaviour of each chemical every year, we created a dashboard on tableau public to help explain our analysis and provide a better visualization. A glimpse of the same is explained here below while the access to the dashboard is present here.

 

 

This dashboard is interactive in nature and the values of the sheets change depending on the selections made by the user. Thus, for example, if you want to select a group of outliers and see their average values for the selected years and months in a tabular format, you can perform the following actions in a Tableau Desktop to get the same.

 

 

  1. After reviewing the data, do any of your findings cause particular concern for the Pipit or other wildlife? Would you suggest any changes in the sampling strategy to better understand the waterways situation in the Preserve? Your submission for this question should contain no more than 6 images and 500 words.

Our understanding is that chemicals which have shown a rise in their average values over the course of time will be the major cause of concern for all wildlife going forward. Thus, we have done a location wise analysis of each chemical and filtered out the ones that don’t meet this criterion.

As seen previously, the chemical Methylosmoline has shown a sharp rise in its average value over the last 3 years in Somchair and Kohsoom. Other chemicals such as Arsenic have also followed a similar trend in some locations.

The Tableau public sheet accessible here contains location wise data of all chemicals which have shown a rise in their average values over the years.

 

Changes in sampling strategies:

 

As seen in q1 & q2, there is a clear indication that the method of data collection is manual and not automatic through some sensors that record data readings over a constant period every day, every month, and every year.

·         To get a better idea about the data, we suggest a systematic sampling method going forward where data is picked over a fixed interval of time for visualization for all locations and all chemicals. This is because naturally, the weather patterns change throughout the year according to seasons and with it, changes the amount of water in lakes and their chemical compositions. Thus, to avoid any biases in chemical values caused due to weather or other natural causes, a systematic sampling method is preferable.

·         We suggest a daily sampling to be taken over a course of 3 months straight to get data over a considerable timeline without any gaps. This would help us visualize the anomalies better and make our case for peak values stronger.

·         Data collection should be done using an automated source at the same time of the day everyday during these 3 months for consistency and to also avoid the chances of any manual errors.

·         The direction of water flow (direction of downstream) and its changes (if any) should be indicated to make more in-depth analysis about the expected concentration of each chemical at different places at different points in time.

·         Data collection censors should be located near the factories at each of these locations to get the data of the concentration of chemicals right at the source. This will avoid the dilution of the value of chemicals with water and make our readings more robust.

·         Information about the water speed at different points in time can be a valuable variable to know as it will guide us to where the source of chemical contamination might exactly be.

·         Depth of the river basin at the time of data collection is an important factor to consider as a chemical’s concentration might differ in the reading depending on the depth of the river basin at that point. For example, during rainy season, when the flow and speed of water are both high, even if a substantial amount of chemical is released into the water, the sensor might not be able to detect its high presence as it will quickly get diluted as compared to the summer season, where even a small amount of contamination might show peak values. To eradicate this inconsistency and get a more standardized result, information on the depth of the river basin might be an important criterion of consideration.