VISUAL ASSIGNMENT MINI-CHALLENGE 2: LIKE A DUCK
TO WATER

Introduction:
We are given several years of water sensor readings
from rivers and streams in the Boonsong Lekagul Wildlife Reserve from Professors of the Mistford College Hydrology Department. These samples were
taken from different locations throughout the reserve, which is not necessary
near any of the factories from Kasios. I am now given
the task of looking at this data and see if I can find any evidence that can
replace or supplement the soil evidence that was destroyed. Let’s go!
History:
In
2017, the Kasios Furniture Company was implicated in
environmental damage to the Boonsong Lekagul Wildlife Preserve for both dumping toxic waste and
polluting the air with chemicals from its manufacturing process. Kasios denied any accusation of industrial waste dumping
and stated that there isn’t any ground contamination.
With
the assistance from angry ornithology Mistford
College Professors, it was reported that there was a new range station being
built at the site. However, soil samples taken from the site were inconclusive
in detecting Methylosmolene, the major toxic culprit
of the suspected dumping.
Since
soil samples are not reliable sources anymore, it is time to look at the
hydrology data from the reserve. I am given a map of the Preserve, with named
sampling sites indicated on the map. I am also provided with readings from each
sampling station over time for several different chemicals and water
properties.
Some Information to Start with:
Methylosmolene is the worst volatile
organic compound (VOC) being monitored. It makes sense to focus on its
readings.
AGOC-3A is a replacement
that is more environmentally friendly.
The Questions:
Question 1: Characterize the past and most
recent situation with respect to chemical contamination in the Boonsong Lekagul waterways. Do
you see any trends of possible interest in this investigation?
Question 2: What anomalies do you find in the
waterway samples dataset? How do these affect your analysis of potential
problems to the environment? Is the Hydrology Department collecting sufficient
data to understand the comprehensive situation across the Preserve? What
changes would you propose to make in the sampling approach to best understand
the situation?
Question 3: After reviewing the data, do any
of your findings cause particular concern for the
Pipit or other wildlife? Would you suggest any changes in the sampling strategy
to better understand the waterways situation in the Preserve?
Tableau
Public Links:
You can find the Data Visualization here –
https://public.tableau.com/profile/lim.wee.kiong#!/vizhome/VASTMini-Case2InteractiveDashboardLWK/Q1
https://public.tableau.com/profile/lim.wee.kiong#!/vizhome/VASTMini-Case2MapLWK/Dashboard1
Data Preparation:
Understanding the Raw
Data – Samples Readings and Measures
The data given to us comes in 2 main files: Boonsong Lekagul waterways
readings.csv and chemical units of measure.csv.
Data cleaning is done in Excel and data
visualization in Tableau.
Descriptions of the data fields for Boonsong Lekagul waterways
readings are as follow:
Field
|
Description
|
ID
|
Identification number for the record (only for
bookkeeping)
|
Value
|
Measured value for the chemical or property in
this record
|
Location
|
Name of the location sample was taken from. See the map for geo-location of the
sampling site.
|
Sample Date
|
Date sample was taken from the location
|
Measure
|
Chemicals (e.g., Sodium) or water properties
(e.g., Water temperature) measured in the record
|
A sample of the data is shown here:

There
are a total of 136,825 sample data points
across 104 different measures.
The
chemical units of measure csv file is basically the
measures with an additional field for the units of measurement. The sample data
is as shown below:

At
this moment, there does not seem to be any need to clean the data as it looks
usable. However, an initial scan of the csv file shows that there could
potentially be missing data for several, if not all the measures.
Deriving Auxiliary Data –
The Map
Another data given to us is the Waterways Final.jpg,
which is a low-res map of the preserve and it shows the location of the various
sampling points. I believed there is value in knowing the exact coordinates of
each point and hence I have created a tableau version of the map.
Step 1: A new location.csv is created with the
coordinates of the preserve locations and the 4 corners of the map:
|
X
|
Y
|
|
UL
|
0
|
249
|
|
LL
|
0
|
0
|
|
UR
|
249
|
249
|
|
LR
|
249
|
0
|
|
Achara
|
106.5
|
161.18
|
|
Boonsri
|
134.88
|
196.48
|
|
Busarakhan
|
184.7
|
141.8
|
|
Chai
|
153.6
|
126.6
|
|
Decha
|
38
|
101
|
|
Kannika
|
165.3
|
70.6
|
|
Kohsoom
|
185.4
|
166
|
|
Sakda
|
133.5
|
34.6
|
|
Somchair
|
85.1
|
132.1
|
|
Tansanee
|
84.4
|
78.9
|
Step 2: Location.csv is loaded into Tableau and X is
plotted to [Columns] and Y to [Rows]. Location is mapped to [Details].
Step 3: The Waterways jpg is loaded via [Map] >
[Background Images] > [Add Images] > [Waterways Final] to obtain the
final output.

The
points are annotated as well so that when the cursor is at each location, we
can see the exact coordinates of each station:

This
map is important as it will be used as part of the auxiliary data for our
analysis, as we try to determine whether water flow contributes to the
readings.
Obtaining Knowledge on
Hydrology
While this is not a requirement, but it seems useful
to learn more about hydrology and water pollution as we embarked on this task.
We have established earlier that Methylosmolene
is the main toxic compound in question. But what other chemicals or measures
would be useful in knowing its impact to the fauna in the preserve, especially
the birds?

Leading Water
Contaminants and Measures: By
looking at the leading contaminants, I can focus my attention on them as the
measures given are extensive and it is not useful to look at all of them.
Categories
|
Measures in the
Category
|
|
Heavy
Metals and Industrial Deposits: metals are easily washed into streams and groundwater. Copper and
mercury are also found in fungicides. These heavy metals are toxic to biological
life including the people who may have to drink from the polluted rivers.
Crops that have been irrigated with polluted water can also be dangerous.
Heavy metals can also build up in the body causing symptoms of poisoning.
|
Aluminium
Arsenic
Barium
Copper
Lead
Mercury
Selenium
Silver
Zinc
|
|
Chlorine
and Detergents: Paper and pulp mills use
up large amounts of water and produce a lot of polluted wastewater. The
wastewater contains strong chemicals such as chlorine, which is used to make
paper white and soft. Textile factories also release strong chemicals like
caustic soda, acids, dyes and detergents into water. These strong poisons
also cause bird and fish kills
|
Chlorine
Chloramine
Chromium
|
|
Fertilisers
and Nitrates: Some chemicals like
fertilisers are made of substances that do occur naturally in the
environment, but only in small amounts. Phosphates and nitrates are found in
fertilisers, sewage and soaps. The normal low phosphate level in water
inhibits the growth of plants but a small increase of phosphates can result
in a rapid increase in plant growth such as blue-green algae and water
hyacinth, especially in dams. The water plants become overcrowded and die.
When they die, the decomposing bacteria uses up more oxygen and affects other
forms of life badly, e.g. fish suffocate. This process is eutrophication.
Nitrogen in the form of ammonia and nitrates form
part of the plant nutrients that can lead to eutrophication. Nitrogen
normally occurs in a form that plants cannot use (i.e. nitrogen gas),
however, it may be used in the decomposition of dead water plants and by
blue-green algae which can convert nitrogen in the air into ammonia and
nitrates that plants can use.
|
Ammonia
Nitrates
Nitrites
Total
Dissolved Phosphorus
Total
Phosphorus
|
|
Bacteria
and Virus in Water: Some of the more dangerous microbial contaminants,
such as E. coli, Giardia, and Cryptosporidium, can cause
gastrointestinal problems and flu-like symptoms commonly attributed to
undercooked or improperly stored food.
|
Total
Coliforms
Fecal Coliforms
|
|
Salt
in Water: When you consume too
much sodium, the body holds extra water. The kidneys which filter out waste
from the blood, maintain a special ratio of electrolytes, such as sodium to
potassium, to water. More salt in the diet means the kidneys keep more water
in the system. That can have lots of undesirable effects, such as edema (swelling in places like the hands, arms, feet,
ankles, and legs); more fluid in general means more blood coursing through
veins and arteries. Over time, that causes them to stiffen, which could lead
to high blood pressure.
|
Total
Dissolved Salts (at most 600mg/l)
|
The above table will form the basis in which data
filtering will be done as we focus more on these measures than the rest, due to
their undesirable impacts to life.
References: The references for the above-mentioned
information came from:
https://www.wqa.org/learn-about-water/common-contaminants
https://www.conserve-energy-future.com/sources-and-causes-of-water-pollution.php
http://www.waterwise.co.za/site/water/environment/substances.html
Dashboard Design
Joining the Data Sources
The first step is to join the 3 data sources into
one and do some Exploratory Data Analysis:

Location and the Waterway data points will be joined
as an inner join while the Waterway data points are joined with the units of
measurements as a right outer join.
The sample joined data is shown here:

It contained:
-
ID: ID of the Sample Reading
-
Value: Value of the Sample Reading
-
Location1: Location of the Reading
-
Sample Date: Date of the Reading
-
Measure: The measure of the Sample Reading
-
Unit: Unit of the Sample Reading
-
X / Y: X and Y coordinates of the Location

The challenge for this challenge is to find
patterns or insights from 100+ measures which are of different natures and
exhibit different behaviour. I designed a couple of interactive dashboards to
sieve through the data. Here are brief introductions of how they are designed
and what are their potential usage. The insights gathered are done through
variations of the following 4 interactive dashboards:
Dashboard 1: Calendar
View of Sample Counts
This is used to have an over-arching view of the
number of sample readings taken for each measure and each location. We can then
understand whether the data were obtained regularly or if there are some gaps
in data-gathering.
Step 1. For Sample Dates, break down to Month in
[Columns] and Year in [Rows]. This enables me to have a clear view of the data
across the years (1998 to 2016) and over the months (Jan to Dec)
Step 2. Drag Id to [Details]. Change the [Measure]
to [Count]. This is because I am interested in the number of readings taken for
a period for individual locations.
Step 3. Measure and Location1 are added to the
[Filters] as [Multiple Values(List)]. I can then filter the data based on
location and measure
Step 4. Location1 and Measure are added to the
[Tooltip] to provide more information.

The default view with zero filtering shows that the readings
were taken mostly in the 2005 -2009 periods.

But when we filter the location and measure, e.g. in
this instance Water Temperature readings in Kannika,
we will find that there is time where no reading was taken. Also, there was a
lot of readings taken in 2011.
Dashboard 2: Multi-Facet
View of Measures using Row / Column Dividers
The Calendar view helps to see where the missing
data are. But that is useful only to the number of records taken. To study the
actual results and compare the measures across different location, or to get a
view of all the measures in one location, I created Dashboard 2, the
Multi-Facet view.
To create this view, I need to create 2 calculated
fields. Both fields help to divide the entire view into equal portions based on
the number of locations or measures I have filtered:
(i)
Column
Divider (formula):

(ii)
Row Divider
(formula):

For 2a, Multi-Facet View of One Measure (or More)
across Different Location
Step 1. Drag Column Divider to [Columns] and Row
Divider to [Rows]. Click on the inverted triangle on the right of the pill and
select [Compute Using] > [Location1]
Step 2. Drag Sample Date (Year) to [Columns] and
Sum(Value) to [Rows]
Step 3. Measure and Location1 are added to the
[Filters] as [Multiple Values(List)]. I can then filter the data based on
location and measure. For this view, we only try to filter based on Measure.
Step 4. In the [Marks pane], Location1 is dragged to
[Details] and Measure is dragged to [Labels].
Step 5. I added an [Average Line] which is dotted
and show the average value of the measure shown.
Step 6. Lastly, I added an [Annotation] by area for
each location so that we can see the graph for each location clearly.

Dashboard 2a allows us to compare the measures
reading across different location. In this example, we can see the difference
of Total Hardness across all 10 locations.
For 2b, Multi-Facet View of Multiple Measures in one
location
Step 1. Drag Column Divider to [Columns] and Row
Divider to [Rows]. Click on the inverted triangle on the right of the pill and
select [Compute Using] > [Measures]
Step 2. Drag Sample Date (Year) to [Columns] and
Sum(Value) to [Rows]
Step 3. Measure and Location1 are added to the
[Filters] as [Multiple Values(List)]. I can then filter the data based on
location and measure. For this view, we only try to filter based on Measure. The
location should be fixed at one.
Step 4. In the [Marks pane], Location1 is dragged to
[Details] and Measure is dragged to [Labels].

Dashboard 2b allows us to compare multiple measure
reading in one location. In this example, we see multiple measures that is
related to the earlier pollution study in one location: Kohsoom.
With
the knowledge gathered from Dashboard 1, 2a and 2b, we know that there are
quite a lot of missing data and it shows that probably line graph is not the
best way to represent the data.
Therefore,
there is a 3rd variation, Dashboard 2c, Boxplot version.
The
steps to obtain this is like the previous 2 dashboards except that I did not
use Row and Column Dividers, and everything is converted to Boxplot instead of
line graph. Years of the Sample Date is also embedded in the details.

Dashboard
2c is an improvement over 2a and 2b as it takes away the inaccuracy of using
line graphs, but there is still value in keeping 2a and 2b, as it gives us a
good way to isolate data and filter items out.
Dashboard 3: Dot-plots of
Individual Measures for Seasonality
Dot-plots are useful when we want to zoom in to
certain measures and explore the seasonality of the data. Not all data exhibit
seasonality, but this is useful when we want to explore data that has regular
records. One extremely useful measure for this will be Water Temperature. To
obtain this dashboard:
Step 1. Drag Measure and Sample Date (break down to
days) to [Columns]
Step 2. Drag Location1 and Value(Sum) to [Rows]
Step 3. Under [Marks], drag Value(Sum) to [Colors] and Id to [Details]. Choose a suitable color scheme to show the up and down of Sample Value.
Step 4. Measure and Location1 are added to the
[Filters] as [Multiple Values(List)]. I can then filter the data based on
location and measure. For this view, we only try to filter based on Measure and
Location. The measure should be fixed at one.

Dashboard 3 shows the seasonality of water
temperature across all the locations. Water temperature peaks in Aug (summer)
and go lowest around January (winter)
This
dashboard plots the seasonality beautifully and can help to identify unusual
readings easily as dots that are exhibiting unusual patterns will show up.
Insights and Findings
Question 1: Characterize the past and most
recent situation with respect to chemical contamination in the Boonsong Lekagul waterways. Do
you see any trends of possible interest in this investigation?
Insight from Data
Visualization
The amount of readings taken over the years are
uneven. There seem to be less readings in the distant past and the recent
times, while more readings were taken of the middle years [2005 – 2008].

Taking an example of arsenic, data collection
has also been sporadic. There are no data taken between 1999 to 2003, and 2005
to 2007. Extracting patterns from missing data will be inaccurate.

In an overview chart of all the measures taken
for all locations, we can see that some values had huge peaks at certain times,
e.g. iron in 2003 and increased deposits of dissolved salts from 2005. This
data is inconclusive unless we delve deeper into each location and examine
their seasonality.

For instance, heavy metals such as iron and
zinc had interesting observations by comparing the past and present. These
metals are potentially dangerous when found in high quantity in water.


Water Quality Measures are important as well,
as dissolved minerals in high quantity could result in bacteria and viruses
permeating through water, where animals and people can drink and be harmed.

When I looked at the seasonality of each
measure over all the locations, we can find different trends between the past
and present. For example, in selected locations, e.g. Achara,
I found elevated levels of mercury from 2010 – 2011.

Or high levels of zinc in Kannika
and Boonsri:

I also saw increased volatility in one
location: Tansanee. The volatility seems to be
apparent in a couple of measures: bicarbonates, chlorides, nitrates and total
nitrogen.

The last observations for seasonality is that
some data exhibit unstable readings in the early years and stable readings
recently, e.g. chromium, and the reverse behaviour for some, e.g. total
hardness. Water conditions differ through the years and over the different
regions constantly.


Summary from Insights
The readings do not show consistent behaviour from
the past to present. There are times when measures stabilized and in other
times, the reverse happen. Some areas have missing data from the past but
exhibit curious behaviour in the present, e.g. Tansanee.
It is definitely worthwhile to investigate all these
phenomena in greater details. Which leads us to the next question… can we
improve the sampling process?
Question 2: What anomalies do you find in the
waterway samples dataset? How do these affect your analysis of potential
problems to the environment? Is the Hydrology Department collecting sufficient
data to understand the comprehensive situation across the Preserve? What
changes would you propose to make in the sampling approach to best understand
the situation?
Insight from Data
Visualization
As discovered from the last question, the data
collected are uneven and irregular. From the Calendar plot, we can see that
there are times when more data was collected and time when the collection is
very low.

It is also curious that for three sites, there
were no data collected before 2009. It is probably due to water sensors being
installed at a later stage.

Another major problem with the data is the
lack of timestamping. It is found that for certain measures for specific
locations, there could have been multiple readings taken. But as there were no
time stamp, we would not know the exact change in reading throughout the day
for that measure. In this example, there were 0 – 3 readings taken for arsenic
in Sakda over the period of observation.

Another observation made is with regards to
the placement of water sensors. Water readings are highly influenced by water
tributaries and some of the sensors are connected to one another. The ten
locations can be separated into four clusters (the red boxes below denote the 4
clusters). For instance, one cluster had 5 sensors and their readings may be
interdependent.


Summary from Insights
The anomalies arise from the lack of data and
irregular reading of the data. This is due to two factors:
-
Lack
of resources to constantly read the data
-
Uneven
distribution of water sensors over the four river streams
Also, if we have the avenue to overlay the Kasios factory location over the data, we can probably find
out more correlations between the factory discharge and the water measure
readings.
My recommended changes to the sampling approach
include:
1.
Additional
water sensors for each main river streams and at least one sensor at each major
tributary
2.
Due
to resource issues, to focus the reading on critical water contaminants (as I
have outlined in the data preparation portion)
3.
Reduced
but focused reading of data. Since the water data are largely seasonal, one way
to reduce effort is to focus the reading for the data over a focused period. We
can do 3 – 5 measures for all locations from Jan – Mar, and switch to another 3
– 5 measures. This will help to alleviate the stress of limited resources yet
provide a better apple-to-apple comparison.
Now that we understand the shortcoming of this data
set, our primary problem has not been solved yet. Are the water data showing
anything that is related to the beautiful birds?
Question 3: After reviewing the data, do any of your
findings cause particular concern for the Pipit or
other wildlife? Would you suggest any changes in the sampling strategy to
better understand the waterways situation in the Preserve?
Insight from Data
Visualization
The culprit in question: Methylosmoline.
The data does not stretch a long time into the past,
and it is fairly recent. There is a sharp increase in
the data for both Kohsoom and Somchair,
with Chai having a higher than average reading as well.

Surprisingly, the increase in Methylosmoline
readings in the two areas did not affect the mineral deposits as well. So,
there are inconclusive evidence of correlation between Methylosomoline
with other measures.

The behaviour of Methylosmoline
at Chai, Kohsoom and Somchair
are not consistent as well.
There are two curious observations. In Kohsoom, there was an extremely high reading of 145 µg/l on
Aug 16, and the whole period of 2016 had very high values. It tapered down in
2017, but there seems to be suspicious activities in 2016.
For Somchair, the
high value of 130 µg/l is sustained from mid-2016 to now, which means that
there are potentially dumping activities in that region happening even now, and
the activities are continuing.
These 2 readings posed a serious threat to the
wellbeing of the Pipit population (as well as the wildlife).

The total dissolved salt in water is cause for
concern as well, as the Pipit will drink from the water and too much salt could
lead to death. Tansanee, Busarakhan
and Somchair has been experiencing high levels of
dissolved salt in recent years.

The Coliforms level is elevated in Kohsoom as well, which could cause harmful bacteria leading
to more death to the Pipit population.

Summary from Insights
The findings on methylosmoline,
dissolved salt and coliforms caused particular concern
to Pipit and other wildlife in the waterways. There seems to be elevated
readings of these measures and all could cause potential harm as it could:
-
Affect
the water they consume
-
Cause
illnesses such as skin disease with constant contact
-
Cause
harm to neighbouring flora that uses the water for irrigation, which in turn
causes harm as the animals eat the flora
However, I cannot find direct connection between the
various measures, e.g. increased level of methylosmoline
does not impact any other water pollutants in the same area directly.
This lack of correlation is due to the observations
in Question 2:
-
Lack
of resources to constantly read the data
-
Uneven
distribution of water sensors over the four river streams
The change in sampling strategies are the same:
1.
Additional
water sensors for each main river streams and at least one sensor at each major
tributary
2.
Reduced
but focused reading of critical water contaminants data
Conclusion
In summary, we can reasonably conclude that:
1.
Irregular
Recording Intervals.
The readings do not show consistent behaviour from the past to present. There
are times when measures stabilized and in other times, the reverse happen. Some
areas have missing data from the past but exhibit curious behaviour in the present,
e.g. Tansanee. It is worthwhile to investigate all
these phenomena in greater details.
2.
Reasons
for Missing Data. It would be
from the lack of resources and the uneven distribution of water sensors over
the four river streams. Also, if we have the avenue to overlay the Kasios factory location over the data, we can probably find
out more correlations between the factory discharge and the water measure
readings.
3.
Changes
to Sampling Strategies. I
suggest a few ways to improve sampling, (1) Additional water sensors for each
main river streams and at least one sensor at each major tributary, (2) focus
the reading on critical water contaminants and (3) reduced but focused reading
of data. Since the water data are largely seasonal, one way to reduce effort is
to focus the reading for the data over a focused period. We can do 3 – 5
measures for all locations from Jan – Mar, and switch to another 3 – 5
measures. This will help to alleviate the stress of limited resources yet
provide a better apple-to-apple comparison.
4.
Environmental
Impact on the Pipit and Wildlife. The findings on methylosmoline, dissolved
salt and coliforms caused concern to Pipit and other wildlife in the waterways.
There seems to be elevated readings of these measures and all could cause
potential harm as it could (1) affect the water they consume, (2) cause
illnesses such as skin disease with constant contact and (3) cause harm to
neighbouring flora that uses the water for irrigation, which in turn causes
harm as the animals eat the flora. However, I cannot find direct connection
between the various measures, e.g. increased level of methylosmoline
does not impact any other water pollutants in the same area directly. Once
sampling strategies are improved, more meaningful conclusion can be drawn.