Application Examples of the Hierarchical Clustering Explorer

Return to main project web page

Netscan | Cereal | US counties | Application Reports from Information Visualization Class | Online Behavioral Data

Example 1 : NetScan data set  

Data source : http://netscan.research.microsoft.com/

Data files : 

The meaning of each column : 

  1. 1st column : name of newsgroup
  2. Posts : # of messages that were contributed to the newsgroup
  3. Posters: : # of people who contributed at least on message to the newsgroup
  4. PPRatio: the ratio of posters to posts
  5. Returnees: # of people who contributed to the newsgroup in the current time period and also contributed a message in the previous time period
  6. Replies: # of people who contributed at least one message that was a reply to another message
  7. UnRMSGS: # of messages in the newsgroup that did not receive any reply in the newsgroup
  8. Avg.LineCT: average # of lines in each message
  9. XPosts:# of messages that were shared with at least one other newsgroup
  10. XPTgs:# of newsgroups that shared messages with the selected newsgroups

Screenshots:

Findings :

Return to top

Example 2 : cereal data

Data source : Healthy Breakfast Story at StatLab ( http://lib.stat.cmu.edu/

Data files : 

The meaning of each column :

  1. 1st column : Name of cereal
  2. calories: calories per serving
  3. protein: grams of protein
  4. fat: grams of fat
  5. sodium: milligrams of sodium
  6. fiber: grams of dietary fiber
  7. carbo: grams of complex carbohydrates
  8. sugars: grams of sugars
  9. potass: milligrams of potassium
  10. vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
  11. shelf: display shelf (1, 2, or 3, counting from the floor)
  12. rating: a rating of the cereals (calculated by Consumer Reports)

Screenshots:

 

Findings :

Return to top | Netscan | Cereal | US counties

Example 3 : US counties

Data source :

County-by-County data:
http://users.erols.com/turboperl/dcmaps.html ( I tried to access this page again on 11/19/2004, but this page was not available.)

FIPS code - county names:
http://www.census.gov/population/cencounts/1900-90.txt
http://www.arfsys.com/indep.htm ( Denali Alaska 02068 )

Data files :

The meaning of each column :

  1. 1st column : County ID
  2. Name: County name
  3. HomeValue2000: median value of owner-occupied housing value, 2000
  4. Income1999: per capita money income, 1999
  5. Poverty1999: percent below poverty level, 1999
  6. PopDensity2000: population, 2000
  7. PopChange: population percent change, 4/1/2000-7/1/2001
  8. Prcnt65+: population 65 years old and over, 2000
  9. Below18: person under 18 years old, 2000
  10. PrcntFemale2000: percent of female persons, 2000
  11. PrcntHSgrads2000: percent of high school graduates age 25+, 2000
  12. PrcntCollege2000: percent of college graduates or higher age 25+, 2000
  13. Unemployed: person unemployed, 1999
  14. PrcntBelow18: percent under 18 years old, 2000
  15. LifeExpectancy: life expectancy, 1997
  16. FarmAcres: farm land (acres), 1997
  17. LungCancer: lung cancer mortality rate per 100,000, 1997
  18. ColonCancer: colon cancer rate per 100,000, 1997
  19. BreastCancer: breast cancer per 100,000 white female, 19970-1994

Screenshots & Findings:

Four selected histograms ranked by the biggest gap size. Gap detection was performed with standardized values (i.e. in this case all dimensions are transformed to a distribution whose mean is 0 and the standard deviation is 1). The gap ranking criterion is affected by whether the original or transformed values are used for ranking. Ranking computations based on the original values (values before transformation), produce a different ranking result since the range of the values may change due to the transformation. The biggest gap is highlighted as a peach rectangle on each histogram. The bar to the right of the gap on (a) is for Los Angeles, CA. The bar to the right of the gap on (b) is for Coconino, AZ, which means that Coconino County has exceptionally broad farm lands.


(a) 21.0

(b) 5.77

(c) 0.38

(d) 0.24

Next, if users move on to the rank-by-feature framework for 2D projections, they can choose “Correlation coefficient” as the ranking criterion. And again they preattentively identify three very bright red cells and two very bright green cells in the score overview (b). The scatterplot for one of the high-scored cells is shown in Figure 7a, where LA is highlighted with an orange triangle in a circle at the top right corner. Interestingly, the three bright cells are composed by the three dimensions that have very low scores in 1D ranking by “Uniformity.” LA is also a distinctive outlier in all three high scored scatterplots. Users can confirm a trivial relationship between poverty and income, i.e. poor counties have less income (c). The scatterplot for one of the two bright green cells is shown in (d), revealing that counties with high percentages of high school graduates are particularly free from poverty.


(a) 0.96

(b) 0.77  

(c) -0.69

(d) -0.71

User can then run the ranking by quadracity to identify strong quadratic relationships, producing 4 interesting scatterplots. Following 4 figures show relatively strong quadratic relationships.  It is interesting to know that they also showed strong linear relationships according to the correlation coefficient ranking, but each pair of variables in (a) and (d) actually have a quadratic relationship.  (b) and (c) show weak quadratic relationships. The fitting errors should be considered by looking into the regression curve and points distribution before confirming the relationships.


(a) 0.96

(b) 0.77

(c) -0.69

(d) -0.71

Return to top | Netscan | Cereal | US counties | Application Reports from Information Visualization Class