Exploring the UNICEF
World Socioeconomic Statistics
using HCE3.0
February, 28
1.
Problem
This report mainly deals with socioeconomic statistics of
195 countries of the world. For each country, 18 different features are
measured. I try to find nice clusters among these countries based on these
features, find features which are highly correlated, find countries which stand
out as an anomaly, etc. Meanwhile, I focus more on where my own country,
2. Dataset
The data I am using here comes from UNICEF organization. The reason I have chosen this dataset is that it is the most updated one of its kind; more specifically it is gathered and synthesized in year 2006. The other similar data I looked at where mostly from year 2005.
This data set can be found in http://www.unicef.org/statistics/index_24183.html. I used the statistical tool provided there to generate the required excel files. To get the data from this database, I first chose the countries I was interested in (which in my case was all the counties). I had to do so in two rounds as the system was unable to give the data for all countries in one round. The next task is to choose the desired dimensions for the countries. The statistics associated with each country is divided into 5 main subgroups: basic indicators, child protection, demographics, economics, education, health, HIV AIDS, nutrition, rate of progress and women. Under each subgroup, there are different dimensions that one can choose to include in the final table. I chose the following total of 18 dimensions from basic indicators, demographics, economics, education health, rate of progress and women:
GNI (Gross National Income) per capita (US $), GDP (Gross Domestic Product) per capita average annual growth rate, Life expectancy at birth(years), Life expectance: females as a percentage of men, Population annual growth rate, Percentage of population under 5, Percentage of population under 18, Percentage of population urbanized, Average annual growth rate of urban population, , Maternal mortality ratio, Percentage of population using adequate sanitation facilities, Contraceptive prevalence, Antenatal care coverage rate, Average annual rate of reduction, Total fertility rate, Number per 100 population of internet users, Total Adult Literacy rate, Adult literacy rate: females as a percentage of males.
I had initially added more dimensions but I deleted a couple of them either because of the missing values which made that dimension non-informative or because of HCE3.0 having difficulty dealing with the data when number of dimensions is really high (specially when there is a large number of data items).
3. Approach
I followed the GRID principle to explore the dataset, and it turns out that it was a very nice approach and it helps me find out many nice features in the dataset.
3.1.
Dendrogram
Overview and Mosaic View
Loading the data into HCE3.0 leads to the following dendrogram for the world:

Figure 1. Dendrogram of the world statistics
Here is where the first nice pattern comes up. The bottom
rows of the color mosaic correspond to rate of young population of the
countries, death rates. As indicated in the figure, western countries have very
low values; African countries have high values and
The next step is extracting

Figure 2.
Looking at the above detailed view, the following can be inferred:
·
The brightest red point in
·
There are more green points in
·
The other nice property of the produced color
mosaic is that I was able to see which countries are clustered close to
The most interesting point in this detailed view is the
following: It is widely thought that
v
Misbelief:
Iran has one of the world’s
largest percentages of under 18 population!
3.3.
Histogram
There are 18 dimensions available in the histogram tab. In this
ordering, the following criteria were those I found some of the criteria more
meaningful in the context of my dataset. Also I tried to mark
·
Number of outlier:
1. Adult literacy rate: Female as a percentage of male:

Figure 3. Outliers when mapped on Adult literacy rate: females as a % of males
Not surprisingly, all of them are very poor African countries.
2. Total adult literacy rate:

Figure 4.Outlier when mapped on Total adult literacy rate
The fact that the two dimensions ranking high in outlier identification concerns education and both identify African countries shows how much poor African countries need help from the rest of the world.
·
Biggest gap:

Figure 5. Biggest gap is produced by a much unknown country!
In this case, I modified the data set and deleted

Figure 6.
v

Figure 7. Large gaps are produced by
mapping on GDP growth rate.
·
Uniformity:
One interesting observation is about the average annual rate of reduction. The Millennium Development Goal 4 (MDG 4) calls on countries to reduce by two thirds, between 1990 and 2015, so it implies a 4.4% target average annual rate of reduction. The statistics shows the following:
· MINIMUM: Zimbabwe,-3.4
·
MAXIMUM:
· MEAN: 3.5; which is less than what aimed for.
· STDEV: 2.19
And
v
World
unable to achieve the aimed rate of reduction whereas
3.3.
Scatterplot
There were a total of 153 combinations available. In the scatterplot ordering, the (combination of the) following criteria was those I found more meaningful in the context of my dataset:
·
Correlation Coefficient and
Before looking at some meaningful combinations, there are two points to mention:
First thing is that in the ordered list of pairs of
dimensions, the few first ranked combinations are the trivial ones (such as
plotting rate of population under 18 versus rate of population under 5, etc);
so I skip the trivial ones and only mention some interesting combinations which
are still highly correlated. The second thing is that in all scatterplots

Figure 8. Plotting all countries over Urban population growth rate versus Total population growth rate
This makes sense as GNI can be viewed as a measure of individual wealth in a country. As marked in the figure, Iran does not have that many internet users yet, and unfortunately its GNI per capita is also lower than world average.

Figure 9. Plot over Number of Internet users versus GNI per capita

Figure 10. Plot over Fertility rate versus Percentage of population urbanized

Figure 11. Plot over Rate of population under 5 versus Contraceptive prevalence
The last thing I did was exploring scatterplots to find anomalies. Here are some of them:
·
Look at the scatterplot of Maternal mortality
rate versus Antenatal care coverage rate. Expectedly, the maternal mortality rate
becomes lower with more antenatal care coverage. Two obvious anomalies are

Figure 12. Anomaly: High Maternal mortality rate in spite of high Antenatal care coverage
·
Look at the scatterplot of Life expectancy at
birth versus GDP. The very obvious anomaly is

Figure 13. Anomaly: Very low Life expectancy and very high GDP growth rate
4. Comments
The first problem with the HCE3.0 system was that I could not choose what dimensions I want to contribute to the value assigned to each cluster (which is then used to computes the similarities in the clustering algorithm) as when I tried to uncheck some of the dimensions I got an error message. The next problem was that I could not close one file in the system and so I had to close the whole system each time and then load it again.
One other problem is that system is unable of dealing with non-numeric values. This is not a problem if the non-numerical value is YES/NO as you I easily converted them to 0-1; but as a problematic instance consider a field shown as percentages (with the character % included in the dimension value). In such cases, I had to delete the % character manually in order to enable system to deal with the numbers correctly.
Another shortcoming was that with 29 variables, system encountered some problems while running the clustering algorithm. It generates color mosaics for each country but does not cluster them.
One other general problem is lack of pixels and the fact that you have to keep scrolling left and right to be able to view the scatterplot and orderings. One option might be to add a feature to make the user resize each of the four coordinated parts of histogram/scatterplot overview.
Aside from the above problems which are more system problems, I found HCE very powerful in describing different patterns and finding interesting anomalies for such large datasets which are impossible to explore manually.
One nice thing about the tool is the way it represents missing data which is quiet frequent in my data set as data from many countries is not available. HCE show them as white mosaics in the color mosaic view which looks very natural to me.
Another interesting thing I found out while playing with different dimensions is the following: HCE gives much more meaningful information when dealing with dimensions which are rates (versus being absolute values) as for instance a country with very high population will have much higher absolute value for its under 5 population too, and these very high absolute values lead into outliers which I was not looking for. Another realization of this fact is the following: If individual data points for the aggregate statistics of the world, industrial countries, developing countries and Middle Eastern countries (to be able to directly compare them) are added, the result of including dimensions concerning absolute values as population will be that their large value for this aggregate data points misleads the clustering algorithm to put them in a very different cluster. On the other hand, the point is that the statistics concerning a portion of population are really important characteristics, so I added them as rates (dividing them by total population of the country).