Exploring the UNICEF
    World Socioeconomic Statistics
 using HCE3.0

 

Maryam Farboodi (farboodi@cs.umd.edu)

February, 28

 

1.  Problem

 

This report mainly deals with socioeconomic statistics of 195 countries of the world. For each country, 18 different features are measured. I try to find nice clusters among these countries based on these features, find features which are highly correlated, find countries which stand out as an anomaly, etc. Meanwhile, I focus more on where my own country, Iran, stands relative to the rest of the world. So one of my attempts has been to always point out Iran in the results.

 

2.   Dataset

 

The data I am using here comes from UNICEF organization. The reason I have chosen this dataset is that it is the most updated one of its kind; more specifically it is gathered and synthesized in year 2006. The other similar data I looked at where mostly from year 2005.

 

This data set can be found in http://www.unicef.org/statistics/index_24183.html. I used the statistical tool provided there to generate the required excel files. To get the data from this database, I first chose the countries I was interested in (which in my case was all the counties). I had to do so in two rounds as the system was unable to give the data for all countries in one round. The next task is to choose the desired dimensions for the countries. The statistics associated with each country is divided into 5 main subgroups: basic indicators, child protection, demographics, economics, education, health, HIV AIDS, nutrition, rate of progress and women. Under each subgroup, there are different dimensions that one can choose to include in the final table. I chose the following total of 18 dimensions from basic indicators, demographics, economics, education health, rate of progress and women:

 

GNI (Gross National Income) per capita (US $), GDP (Gross Domestic Product) per capita average annual growth rate, Life expectancy at birth(years), Life expectance: females as a  percentage of men,  Population annual growth rate, Percentage of population under 5, Percentage of population under 18, Percentage of population urbanized, Average annual growth rate of urban population, , Maternal mortality ratio, Percentage of population using adequate sanitation facilities, Contraceptive prevalence, Antenatal care coverage rate, Average annual rate of reduction, Total fertility rate, Number per 100 population of internet users, Total Adult Literacy rate, Adult literacy rate: females as a percentage of males.

 

I had initially added more dimensions but I deleted a couple of them either because of the missing values which made that dimension non-informative or because of HCE3.0 having difficulty dealing with the data when number of dimensions is really high (specially when there is a large number of data items).

 

3.  Approach

 

I followed the GRID principle to explore the dataset, and it turns out that it was a very nice approach and it helps me find out many nice features in the dataset.

 

3.1.   Dendrogram Overview and Mosaic View

 

Loading the data into HCE3.0 leads to the following dendrogram for the world:

 

Figure 1.  Dendrogram of the world statistics

Here is where the first nice pattern comes up. The bottom rows of the color mosaic correspond to rate of young population of the countries, death rates. As indicated in the figure, western countries have very low values; African countries have high values and Middle East countries have the average values, and they are very nicely clustered together. The fact that Middle Eastern and Developing countries are situated quiet close and their tiles are mostly very dark shows that the world aggregate statistics is much more affected by developing countries than developed ones.

 

 

The next step is extracting Iran out of the color mosaic:

 

Figure 2. Iran Statistics in detail

 

Looking at the above detailed view, the following can be inferred:

 

Iran is in many characteristics very much in the neutral range, except e few things:

 

·        The brightest red point in Iran’s record is percentage of contraceptive prevalence.  This is quiet interesting and I think one of its reasons is that there is not much control from side in this area. Interestingly, the next dimensions are percentage of population using adequate sanitation facilities and average rate of annual reduction, hopefully promising factors toward further development of the country. The other two dimensions deviating by more than 0.5 from the mean of the world (after being normalized) are percentage of population urbanized and percentage of antenatal coverage; where the first is causing problems especially around the capital.

·        There are more green points in Iran’s tile: the brightest one is life expectancy of females as a percentage of males. The next ones are maternal mortality rate and rate of population fewer than five and total fertility rate, which makes sense comparing to dimensions colored in red. The rest are all less than 0.5 deviated from the mean of the world.

·        The other nice property of the produced color mosaic is that I was able to see which countries are clustered close to Iran and so have same socioeconomic structure.

 

The most interesting point in this detailed view is the following:  It is widely thought that Iran has a very large young population (under 18), which turns out not to be true compared to the rest of the world. Here is where my first headline comes up:

 

v     Misbelief: Iran has one of the world’s largest percentages of under 18 population!

 

3.3.        Histogram

 

There are 18 dimensions available in the histogram tab. In this ordering, the following criteria were those I found some of the criteria more meaningful in the context of my dataset. Also I tried to mark Iran’s position in all figures.

 

·        Number of outlier:

1.      Adult literacy rate: Female as a percentage of male:

 

 Figure 3.  Outliers when mapped on Adult literacy rate: females as a % of males

 

Not surprisingly, all of them are very poor African countries.

 

2.      Total adult literacy rate:

 

Figure 4.Outlier when mapped on Total adult literacy rate

 

The fact that the two dimensions ranking high in outlier identification concerns education and both identify African countries shows how much poor African countries need help from the rest of the world.

 

·        Biggest gap:

  1. Percentage of population under 18:

 

Figure 5. Biggest gap is produced by a much unknown country!

In this case, I modified the data set and deleted Niue. As a result, this dimension does not cause the biggest gap anymore.

 

  1. Average annual growth of urban population:

 

Figure 6. Rwanda seems to be going under a rapid process of urbanization!

 

  1. GDP per capita average annual growth: Equatorial Guinea, then Bosnia and Herzegovina and then China, which I find really amazing as China is the only Developed country with a real high GDP growth rate, and considering its large population it becomes obvious why its economy is growing so fast. My second headline states this fact:

 

v   China has the third highest GDP annual growth rate!

 

Figure 7. Large gaps are produced by mapping on GDP growth rate.

 

 

·        Uniformity:

  1. Percentage of population under 5
  2. Percentage of population under 18
  3. Percentage of population urbanized

 

One interesting observation is about the average annual rate of reduction. The Millennium Development Goal 4 (MDG 4) calls on countries to reduce by two thirds, between 1990 and 2015, so it implies a 4.4% target average annual rate of reduction. The statistics shows the following:

·        MINIMUM: Zimbabwe,-3.4

·        MAXIMUM:  San Marino, 8.9

·        MEAN:  3.5; which is less than what aimed for.

·        STDEV: 2.19

And Iran has am annual rate of reduction of 4.6. This generates my third headline:

 

v   World unable to achieve the aimed rate of reduction whereas Iran does!

 

 

3.3.      Scatterplot

 

There were a total of 153 combinations available. In the scatterplot ordering, the (combination of the) following criteria was those I found more meaningful in the context of my dataset:

 

·        Correlation Coefficient and Least Square Error-Linear:

Before looking at some meaningful combinations, there are two points to mention:

 

First thing is that in the ordered list of pairs of dimensions, the few first ranked combinations are the trivial ones (such as plotting rate of population under 18 versus rate of population under 5, etc); so I skip the trivial ones and only mention some interesting combinations which are still highly correlated. The second thing is that in all scatterplots Iran is encircled in order to show where it stands relative to the rest of the world.

 

  1. Rank 5th is average annual growth of urban population versus average growth of population (LSQ rank 4th). This shows that people are heading toward cities…

Figure 8. Plotting all countries over Urban population growth rate versus Total population   growth  rate

 

  1. Rank 7th is GNI per capita versus number per 100 internet users (LSQ rank 5th).

This makes sense as GNI can be viewed as a measure of individual wealth in a country. As marked in the figure, Iran does not have that many internet users yet, and unfortunately its GNI per capita is also lower than world average.

 

          Figure 9. Plot over Number of Internet users versus GNI per capita

 

 

 

 

 

  1. Rank 149 (5th in negative correlation) is Total fertility rate versus Percentage of population urbanized (LSQ rank 24th).  This figure shows that as people migrate toward cities they start having fewer children. This can be a direct result of life style in big cities versus rural life style which involves more physical jobs.

 

Figure 10.  Plot over Fertility rate versus Percentage of population urbanized

           

  1. Rank 143 (10th in negative correlation) is Contraceptive prevalence versus Percentage of population under 5 (LSQ rank 35th); which shows that many societies are actually going toward population control.

 

Figure 11.  Plot over Rate of population under 5 versus Contraceptive prevalence

           

 

The last thing I did was exploring scatterplots to find anomalies. Here are some of them:

 

·        Look at the scatterplot of Maternal mortality rate versus Antenatal care coverage rate. Expectedly, the maternal mortality rate becomes lower with more antenatal care coverage. Two obvious anomalies are Rwanda and Malawi. The country with highest maternal mortality rate is Sierra Leone which also suffers from low antenatal care coverage rate.

 

         Figure 12.  Anomaly: High Maternal mortality rate in spite of high Antenatal care coverage

 

·        Look at the scatterplot of Life expectancy at birth versus GDP. The very obvious anomaly is Equatorial Guinea with high GDP but very low life expectancy.

 

Figure 13.  Anomaly:  Very low Life expectancy and very high GDP growth rate

 

4. Comments

 

The first problem with the HCE3.0 system was that I could not choose what dimensions I want to contribute to the value assigned to each cluster (which is then used to computes the similarities in the clustering algorithm) as when I tried to uncheck some of the dimensions I got an error message. The next problem was that I could not close one file in the system and so I had to close the whole system each time and then load it again.

 

One other problem is that system is unable of dealing with non-numeric values. This is not a problem if the non-numerical value is YES/NO as you I easily converted them to 0-1; but as a problematic instance consider a field shown as percentages (with the character % included in the dimension value). In such cases, I had to delete the % character manually in order to enable system to deal with the numbers correctly.

 

Another shortcoming was that with 29 variables, system encountered some problems while running the clustering algorithm. It generates color mosaics for each country but does not cluster them.

 

One other general problem is lack of pixels and the fact that you have to keep scrolling left and right to be able to view the scatterplot and orderings. One option might be to add a feature to make the user resize each of the four coordinated parts of histogram/scatterplot overview.

 

Aside from the above problems which are more system problems, I found HCE very powerful in describing different patterns and finding interesting anomalies for such large datasets which are impossible to explore manually.

 

One nice thing about the tool is the way it represents missing data which is quiet frequent in my data set as data from many countries is not available. HCE show them as white mosaics in the color mosaic view which looks very natural to me.

 

Another interesting thing I found out while playing with different dimensions is the following: HCE gives much more meaningful information when dealing with dimensions which are rates (versus being absolute values) as for instance a country with very high population will have much higher absolute value for its under 5 population too, and these very high absolute values lead into outliers which I was not looking for. Another realization of this fact is the following: If individual data points for the aggregate statistics of the world, industrial countries, developing countries and Middle Eastern countries (to be able to directly compare them) are added, the result of including dimensions concerning absolute values as population will be that their large value for this aggregate data points misleads the clustering algorithm to put them in a very different cluster. On the other hand, the point is that the statistics concerning a portion of population are really important characteristics, so I added them as rates (dividing them by total population of the country).