VISUAL ASSIGNMENT MINI-CHALLENGE 2: LIKE A DUCK TO WATER

Introduction:

We are given several years of water sensor readings from rivers and streams in the Boonsong Lekagul Wildlife Reserve from Professors of the Mistford College Hydrology Department. These samples were taken from different locations throughout the reserve, which is not necessary near any of the factories from Kasios. I am now given the task of looking at this data and see if I can find any evidence that can replace or supplement the soil evidence that was destroyed. Let’s go!

History:

In 2017, the Kasios Furniture Company was implicated in environmental damage to the Boonsong Lekagul Wildlife Preserve for both dumping toxic waste and polluting the air with chemicals from its manufacturing process. Kasios denied any accusation of industrial waste dumping and stated that there isn’t any ground contamination.

With the assistance from angry ornithology Mistford College Professors, it was reported that there was a new range station being built at the site. However, soil samples taken from the site were inconclusive in detecting Methylosmolene, the major toxic culprit of the suspected dumping.

Since soil samples are not reliable sources anymore, it is time to look at the hydrology data from the reserve. I am given a map of the Preserve, with named sampling sites indicated on the map. I am also provided with readings from each sampling station over time for several different chemicals and water properties.

Some Information to Start with:

Methylosmolene is the worst volatile organic compound (VOC) being monitored. It makes sense to focus on its readings.

AGOC-3A is a replacement that is more environmentally friendly.

 

 

The Questions:

Question 1: Characterize the past and most recent situation with respect to chemical contamination in the Boonsong Lekagul waterways. Do you see any trends of possible interest in this investigation?

Question 2: What anomalies do you find in the waterway samples dataset? How do these affect your analysis of potential problems to the environment? Is the Hydrology Department collecting sufficient data to understand the comprehensive situation across the Preserve? What changes would you propose to make in the sampling approach to best understand the situation?

Question 3: After reviewing the data, do any of your findings cause particular concern for the Pipit or other wildlife? Would you suggest any changes in the sampling strategy to better understand the waterways situation in the Preserve?


Tableau Public Links:

You can find the Data Visualization here –

https://public.tableau.com/profile/lim.wee.kiong#!/vizhome/VASTMini-Case2InteractiveDashboardLWK/Q1

https://public.tableau.com/profile/lim.wee.kiong#!/vizhome/VASTMini-Case2MapLWK/Dashboard1

 

 

 

 

 

 

 

 

 

Data Preparation:

Understanding the Raw Data – Samples Readings and Measures

The data given to us comes in 2 main files: Boonsong Lekagul waterways readings.csv and chemical units of measure.csv.

Data cleaning is done in Excel and data visualization in Tableau.  

Descriptions of the data fields for Boonsong Lekagul waterways readings are as follow:

Field

Description

ID

Identification number for the record (only for bookkeeping)

Value

Measured value for the chemical or property in this record

Location

Name of the location sample was taken from.  See the map for geo-location of the sampling site.

Sample Date

Date sample was taken from the location

Measure

Chemicals (e.g., Sodium) or water properties (e.g., Water temperature) measured in the record

A sample of the data is shown here:

There are a total of 136,825 sample data points across 104 different measures.

The chemical units of measure csv file is basically the measures with an additional field for the units of measurement. The sample data is as shown below:

At this moment, there does not seem to be any need to clean the data as it looks usable. However, an initial scan of the csv file shows that there could potentially be missing data for several, if not all the measures.

 

Deriving Auxiliary Data – The Map

Another data given to us is the Waterways Final.jpg, which is a low-res map of the preserve and it shows the location of the various sampling points. I believed there is value in knowing the exact coordinates of each point and hence I have created a tableau version of the map.

Step 1: A new location.csv is created with the coordinates of the preserve locations and the 4 corners of the map:

X

Y

UL

0

249

LL

0

0

UR

249

249

LR

249

0

Achara

106.5

161.18

Boonsri

134.88

196.48

Busarakhan

184.7

141.8

Chai

153.6

126.6

Decha

38

101

Kannika

165.3

70.6

Kohsoom

185.4

166

Sakda

133.5

34.6

Somchair

85.1

132.1

Tansanee

84.4

78.9

Step 2: Location.csv is loaded into Tableau and X is plotted to [Columns] and Y to [Rows]. Location is mapped to [Details].

Step 3: The Waterways jpg is loaded via [Map] > [Background Images] > [Add Images] > [Waterways Final] to obtain the final output.

 

The points are annotated as well so that when the cursor is at each location, we can see the exact coordinates of each station:

This map is important as it will be used as part of the auxiliary data for our analysis, as we try to determine whether water flow contributes to the readings. 

 

Obtaining Knowledge on Hydrology

While this is not a requirement, but it seems useful to learn more about hydrology and water pollution as we embarked on this task.

We have established earlier that Methylosmolene is the main toxic compound in question. But what other chemicals or measures would be useful in knowing its impact to the fauna in the preserve, especially the birds?

Image result for water pollution

Leading Water Contaminants and Measures: By looking at the leading contaminants, I can focus my attention on them as the measures given are extensive and it is not useful to look at all of them.

Categories

Measures in the Category

Heavy Metals and Industrial Deposits: metals are easily washed into streams and groundwater. Copper and mercury are also found in fungicides. These heavy metals are toxic to biological life including the people who may have to drink from the polluted rivers. Crops that have been irrigated with polluted water can also be dangerous. Heavy metals can also build up in the body causing symptoms of poisoning.

Aluminium

Arsenic

Barium

Copper

Lead

Mercury

Selenium

Silver

Zinc

 

Chlorine and Detergents: Paper and pulp mills use up large amounts of water and produce a lot of polluted wastewater. The wastewater contains strong chemicals such as chlorine, which is used to make paper white and soft. Textile factories also release strong chemicals like caustic soda, acids, dyes and detergents into water. These strong poisons also cause bird and fish kills

Chlorine

Chloramine

Chromium

 

Fertilisers and Nitrates: Some chemicals like fertilisers are made of substances that do occur naturally in the environment, but only in small amounts. Phosphates and nitrates are found in fertilisers, sewage and soaps. The normal low phosphate level in water inhibits the growth of plants but a small increase of phosphates can result in a rapid increase in plant growth such as blue-green algae and water hyacinth, especially in dams. The water plants become overcrowded and die. When they die, the decomposing bacteria uses up more oxygen and affects other forms of life badly, e.g. fish suffocate. This process is eutrophication.

Nitrogen in the form of ammonia and nitrates form part of the plant nutrients that can lead to eutrophication. Nitrogen normally occurs in a form that plants cannot use (i.e. nitrogen gas), however, it may be used in the decomposition of dead water plants and by blue-green algae which can convert nitrogen in the air into ammonia and nitrates that plants can use.

Ammonia

Nitrates

Nitrites

Total Dissolved Phosphorus

Total Phosphorus

Bacteria and Virus in Water: Some of the more dangerous microbial contaminants, such as E. coli, Giardia, and Cryptosporidium, can cause gastrointestinal problems and flu-like symptoms commonly attributed to undercooked or improperly stored food. 

Total Coliforms

Fecal Coliforms

Salt in Water: When you consume too much sodium, the body holds extra water. The kidneys which filter out waste from the blood, maintain a special ratio of electrolytes, such as sodium to potassium, to water. More salt in the diet means the kidneys keep more water in the system. That can have lots of undesirable effects, such as edema (swelling in places like the hands, arms, feet, ankles, and legs); more fluid in general means more blood coursing through veins and arteries. Over time, that causes them to stiffen, which could lead to high blood pressure.

Total Dissolved Salts (at most 600mg/l)

The above table will form the basis in which data filtering will be done as we focus more on these measures than the rest, due to their undesirable impacts to life.

References: The references for the above-mentioned information came from:

https://www.wqa.org/learn-about-water/common-contaminants

https://www.conserve-energy-future.com/sources-and-causes-of-water-pollution.php

http://www.waterwise.co.za/site/water/environment/substances.html

 

 

 

 

 

 

 

 

 

 

 

 

 

Dashboard Design

Joining the Data Sources

The first step is to join the 3 data sources into one and do some Exploratory Data Analysis:

Location and the Waterway data points will be joined as an inner join while the Waterway data points are joined with the units of measurements as a right outer join.

The sample joined data is shown here:

It contained:

-          ID: ID of the Sample Reading

-          Value: Value of the Sample Reading

-          Location1: Location of the Reading

-          Sample Date: Date of the Reading

-          Measure: The measure of the Sample Reading

-          Unit: Unit of the Sample Reading

-          X / Y: X and Y coordinates of the Location

Image result for confused

The challenge for this challenge is to find patterns or insights from 100+ measures which are of different natures and exhibit different behaviour. I designed a couple of interactive dashboards to sieve through the data. Here are brief introductions of how they are designed and what are their potential usage. The insights gathered are done through variations of the following 4 interactive dashboards:

 

Dashboard 1: Calendar View of Sample Counts

This is used to have an over-arching view of the number of sample readings taken for each measure and each location. We can then understand whether the data were obtained regularly or if there are some gaps in data-gathering.

Step 1. For Sample Dates, break down to Month in [Columns] and Year in [Rows]. This enables me to have a clear view of the data across the years (1998 to 2016) and over the months (Jan to Dec)

Step 2. Drag Id to [Details]. Change the [Measure] to [Count]. This is because I am interested in the number of readings taken for a period for individual locations.

Step 3. Measure and Location1 are added to the [Filters] as [Multiple Values(List)]. I can then filter the data based on location and measure

Step 4. Location1 and Measure are added to the [Tooltip] to provide more information.

The default view with zero filtering shows that the readings were taken mostly in the 2005 -2009 periods.

But when we filter the location and measure, e.g. in this instance Water Temperature readings in Kannika, we will find that there is time where no reading was taken. Also, there was a lot of readings taken in 2011.

Dashboard 2: Multi-Facet View of Measures using Row / Column Dividers

The Calendar view helps to see where the missing data are. But that is useful only to the number of records taken. To study the actual results and compare the measures across different location, or to get a view of all the measures in one location, I created Dashboard 2, the Multi-Facet view.

To create this view, I need to create 2 calculated fields. Both fields help to divide the entire view into equal portions based on the number of locations or measures I have filtered:

(i)                  Column Divider (formula):

 

(ii)                Row Divider (formula):

 

 

For 2a, Multi-Facet View of One Measure (or More) across Different Location

Step 1. Drag Column Divider to [Columns] and Row Divider to [Rows]. Click on the inverted triangle on the right of the pill and select [Compute Using] > [Location1]

Step 2. Drag Sample Date (Year) to [Columns] and Sum(Value) to [Rows]

Step 3. Measure and Location1 are added to the [Filters] as [Multiple Values(List)]. I can then filter the data based on location and measure. For this view, we only try to filter based on Measure.

Step 4. In the [Marks pane], Location1 is dragged to [Details] and Measure is dragged to [Labels].

Step 5. I added an [Average Line] which is dotted and show the average value of the measure shown.

Step 6. Lastly, I added an [Annotation] by area for each location so that we can see the graph for each location clearly.

Dashboard 2a allows us to compare the measures reading across different location. In this example, we can see the difference of Total Hardness across all 10 locations.

 

For 2b, Multi-Facet View of Multiple Measures in one location

Step 1. Drag Column Divider to [Columns] and Row Divider to [Rows]. Click on the inverted triangle on the right of the pill and select [Compute Using] > [Measures]

Step 2. Drag Sample Date (Year) to [Columns] and Sum(Value) to [Rows]

Step 3. Measure and Location1 are added to the [Filters] as [Multiple Values(List)]. I can then filter the data based on location and measure. For this view, we only try to filter based on Measure. The location should be fixed at one.  

Step 4. In the [Marks pane], Location1 is dragged to [Details] and Measure is dragged to [Labels].

Dashboard 2b allows us to compare multiple measure reading in one location. In this example, we see multiple measures that is related to the earlier pollution study in one location: Kohsoom.

With the knowledge gathered from Dashboard 1, 2a and 2b, we know that there are quite a lot of missing data and it shows that probably line graph is not the best way to represent the data.

Therefore, there is a 3rd variation, Dashboard 2c, Boxplot version.

The steps to obtain this is like the previous 2 dashboards except that I did not use Row and Column Dividers, and everything is converted to Boxplot instead of line graph. Years of the Sample Date is also embedded in the details.

 

Dashboard 2c allows us to compare multiple measure across multiple locations by looking at the boxplot, which indicates the average values through the years, the quartiles and if there are any outliers. For instance, for Dissolved Oxygen, the value is relatively constant except that is it higher in Decha through the years. It is also clear that for some measures such as Fecal Coliforms and Total Dissolved Phosphorus, Achara, Decha and Tansanee has not obtained any relevant readings.

Dashboard 2c is an improvement over 2a and 2b as it takes away the inaccuracy of using line graphs, but there is still value in keeping 2a and 2b, as it gives us a good way to isolate data and filter items out.

Dashboard 3: Dot-plots of Individual Measures for Seasonality

Dot-plots are useful when we want to zoom in to certain measures and explore the seasonality of the data. Not all data exhibit seasonality, but this is useful when we want to explore data that has regular records. One extremely useful measure for this will be Water Temperature. To obtain this dashboard:

Step 1. Drag Measure and Sample Date (break down to days) to [Columns]

Step 2. Drag Location1 and Value(Sum) to [Rows]

Step 3. Under [Marks], drag Value(Sum) to [Colors] and Id to [Details]. Choose a suitable color scheme to show the up and down of Sample Value.

Step 4. Measure and Location1 are added to the [Filters] as [Multiple Values(List)]. I can then filter the data based on location and measure. For this view, we only try to filter based on Measure and Location. The measure should be fixed at one. 

Dashboard 3 shows the seasonality of water temperature across all the locations. Water temperature peaks in Aug (summer) and go lowest around January (winter)

This dashboard plots the seasonality beautifully and can help to identify unusual readings easily as dots that are exhibiting unusual patterns will show up. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Insights and Findings

Question 1: Characterize the past and most recent situation with respect to chemical contamination in the Boonsong Lekagul waterways. Do you see any trends of possible interest in this investigation?

Insight from Data Visualization

The amount of readings taken over the years are uneven. There seem to be less readings in the distant past and the recent times, while more readings were taken of the middle years [2005 – 2008].

Taking an example of arsenic, data collection has also been sporadic. There are no data taken between 1999 to 2003, and 2005 to 2007. Extracting patterns from missing data will be inaccurate.

 

 

In an overview chart of all the measures taken for all locations, we can see that some values had huge peaks at certain times, e.g. iron in 2003 and increased deposits of dissolved salts from 2005. This data is inconclusive unless we delve deeper into each location and examine their seasonality.

For instance, heavy metals such as iron and zinc had interesting observations by comparing the past and present. These metals are potentially dangerous when found in high quantity in water.

 

Water Quality Measures are important as well, as dissolved minerals in high quantity could result in bacteria and viruses permeating through water, where animals and people can drink and be harmed.

When I looked at the seasonality of each measure over all the locations, we can find different trends between the past and present. For example, in selected locations, e.g. Achara, I found elevated levels of mercury from 2010 – 2011.

 

 

 

 

 

 

 

 

Or high levels of zinc in Kannika and Boonsri:

I also saw increased volatility in one location: Tansanee. The volatility seems to be apparent in a couple of measures: bicarbonates, chlorides, nitrates and total nitrogen.

The last observations for seasonality is that some data exhibit unstable readings in the early years and stable readings recently, e.g. chromium, and the reverse behaviour for some, e.g. total hardness. Water conditions differ through the years and over the different regions constantly.

Summary from Insights

The readings do not show consistent behaviour from the past to present. There are times when measures stabilized and in other times, the reverse happen. Some areas have missing data from the past but exhibit curious behaviour in the present, e.g. Tansanee. It is definitely worthwhile to investigate all these phenomena in greater details. Which leads us to the next question… can we improve the sampling process?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Question 2: What anomalies do you find in the waterway samples dataset? How do these affect your analysis of potential problems to the environment? Is the Hydrology Department collecting sufficient data to understand the comprehensive situation across the Preserve? What changes would you propose to make in the sampling approach to best understand the situation? 

Insight from Data Visualization

As discovered from the last question, the data collected are uneven and irregular. From the Calendar plot, we can see that there are times when more data was collected and time when the collection is very low.

It is also curious that for three sites, there were no data collected before 2009. It is probably due to water sensors being installed at a later stage.

Another major problem with the data is the lack of timestamping. It is found that for certain measures for specific locations, there could have been multiple readings taken. But as there were no time stamp, we would not know the exact change in reading throughout the day for that measure. In this example, there were 0 – 3 readings taken for arsenic in Sakda over the period of observation.

Another observation made is with regards to the placement of water sensors. Water readings are highly influenced by water tributaries and some of the sensors are connected to one another. The ten locations can be separated into four clusters (the red boxes below denote the 4 clusters). For instance, one cluster had 5 sensors and their readings may be interdependent.

 

 

Summary from Insights

The anomalies arise from the lack of data and irregular reading of the data. This is due to two factors:

-          Lack of resources to constantly read the data

-          Uneven distribution of water sensors over the four river streams

Also, if we have the avenue to overlay the Kasios factory location over the data, we can probably find out more correlations between the factory discharge and the water measure readings.

My recommended changes to the sampling approach include:

1.       Additional water sensors for each main river streams and at least one sensor at each major tributary

2.       Due to resource issues, to focus the reading on critical water contaminants (as I have outlined in the data preparation portion)

3.       Reduced but focused reading of data. Since the water data are largely seasonal, one way to reduce effort is to focus the reading for the data over a focused period. We can do 3 – 5 measures for all locations from Jan – Mar, and switch to another 3 – 5 measures. This will help to alleviate the stress of limited resources yet provide a better apple-to-apple comparison.

Now that we understand the shortcoming of this data set, our primary problem has not been solved yet. Are the water data showing anything that is related to the beautiful birds?

 

 

 

 

 

 

 

 

 

 

 

 

Question 3: After reviewing the data, do any of your findings cause particular concern for the Pipit or other wildlife? Would you suggest any changes in the sampling strategy to better understand the waterways situation in the Preserve? 

Insight from Data Visualization

The culprit in question: Methylosmoline.

The data does not stretch a long time into the past, and it is fairly recent. There is a sharp increase in the data for both Kohsoom and Somchair, with Chai having a higher than average reading as well.

Surprisingly, the increase in Methylosmoline readings in the two areas did not affect the mineral deposits as well. So, there are inconclusive evidence of correlation between Methylosomoline with other measures.

The behaviour of Methylosmoline at Chai, Kohsoom and Somchair are not consistent as well.

There are two curious observations. In Kohsoom, there was an extremely high reading of 145 µg/l on Aug 16, and the whole period of 2016 had very high values. It tapered down in 2017, but there seems to be suspicious activities in 2016.

For Somchair, the high value of 130 µg/l is sustained from mid-2016 to now, which means that there are potentially dumping activities in that region happening even now, and the activities are continuing.

These 2 readings posed a serious threat to the wellbeing of the Pipit population (as well as the wildlife).

 

The total dissolved salt in water is cause for concern as well, as the Pipit will drink from the water and too much salt could lead to death. Tansanee, Busarakhan and Somchair has been experiencing high levels of dissolved salt in recent years.

 

The Coliforms level is elevated in Kohsoom as well, which could cause harmful bacteria leading to more death to the Pipit population.

 

Summary from Insights

The findings on methylosmoline, dissolved salt and coliforms caused particular concern to Pipit and other wildlife in the waterways. There seems to be elevated readings of these measures and all could cause potential harm as it could:

-          Affect the water they consume

-          Cause illnesses such as skin disease with constant contact

-          Cause harm to neighbouring flora that uses the water for irrigation, which in turn causes harm as the animals eat the flora

However, I cannot find direct connection between the various measures, e.g. increased level of methylosmoline does not impact any other water pollutants in the same area directly.

This lack of correlation is due to the observations in Question 2:

-          Lack of resources to constantly read the data

-          Uneven distribution of water sensors over the four river streams

The change in sampling strategies are the same:

1.       Additional water sensors for each main river streams and at least one sensor at each major tributary

2.       Reduced but focused reading of critical water contaminants data

 

 

 

 

Conclusion

In summary, we can reasonably conclude that:

1.       Irregular Recording Intervals. The readings do not show consistent behaviour from the past to present. There are times when measures stabilized and in other times, the reverse happen. Some areas have missing data from the past but exhibit curious behaviour in the present, e.g. Tansanee. It is worthwhile to investigate all these phenomena in greater details.

2.       Reasons for Missing Data. It would be from the lack of resources and the uneven distribution of water sensors over the four river streams. Also, if we have the avenue to overlay the Kasios factory location over the data, we can probably find out more correlations between the factory discharge and the water measure readings.

3.       Changes to Sampling Strategies. I suggest a few ways to improve sampling, (1) Additional water sensors for each main river streams and at least one sensor at each major tributary, (2) focus the reading on critical water contaminants and (3) reduced but focused reading of data. Since the water data are largely seasonal, one way to reduce effort is to focus the reading for the data over a focused period. We can do 3 – 5 measures for all locations from Jan – Mar, and switch to another 3 – 5 measures. This will help to alleviate the stress of limited resources yet provide a better apple-to-apple comparison.

4.       Environmental Impact on the Pipit and Wildlife. The findings on methylosmoline, dissolved salt and coliforms caused concern to Pipit and other wildlife in the waterways. There seems to be elevated readings of these measures and all could cause potential harm as it could (1) affect the water they consume, (2) cause illnesses such as skin disease with constant contact and (3) cause harm to neighbouring flora that uses the water for irrigation, which in turn causes harm as the animals eat the flora. However, I cannot find direct connection between the various measures, e.g. increased level of methylosmoline does not impact any other water pollutants in the same area directly. Once sampling strategies are improved, more meaningful conclusion can be drawn.