Application Examples of the Hierarchical Clustering Explorer
Return to main project web page
Netscan | Cereal |
US counties |
Application Reports from Information Visualization Class
|
Online Behavioral Data
Example 1 : NetScan data set
|
Data source : http://netscan.research.microsoft.com/
Data files :
- netscan-08-2003.txt (activity log
of newsgroups where name contains "windowsxp" for
August 2003) : 91x10
- netscan-1year.txt (activity log of newsgroups where name contains "windowsxp"
for a year) : 104 x 10
The meaning of each column :
- 1st column : name of newsgroup
- Posts : # of messages that were contributed to the newsgroup
- Posters: : # of people who contributed at least on message to the
newsgroup
- PPRatio: the ratio of posters to posts
- Returnees: # of people who contributed to the newsgroup in the
current time period and also contributed a message in the previous time
period
- Replies: # of people who contributed at least one message that was
a reply to another message
- UnRMSGS: # of messages in the newsgroup that did not receive any
reply in the newsgroup
- Avg.LineCT: average # of lines in each message
- XPosts:# of messages that were shared with at least one other
newsgroup
- XPTgs:# of newsgroups that shared messages with the selected
newsgroups
Screenshots:
Findings :
- The most active groups in terms of the number of people involved
cluster together. Those groups -
microsoft.public.windowsxp.perform_maintain,
microsoft.public.windowsxp.network_web,
microsoft.public.windowsxp.security_admin, microsoft.public.windowsxp.hardware -
are all advanced user groups.
- They look like very active communities. ( large number of posters,
repliers, posts, and etc.)
- However, there are large number of isolated messages that might be
questions with no answers yet, or might be questions ignored because they look
like uninteresting to the advance users in those groups.
-
microsoft.public.es.* groups tightly cluster together except for the .windowsxp
group. They share the followings.
- Relatively large number of XPosts (crosspostings) : reference many
postings in other groups.
- Low PPRatio : Small number of posters post large number of
postings.
Return to top
Data source : Healthy
Breakfast Story at StatLab ( http://lib.stat.cmu.edu/
)
Data files :
The meaning of each column :
- 1st column : Name of cereal
- calories: calories per serving
- protein: grams of protein
- fat: grams of fat
- sodium: milligrams of sodium
- fiber: grams of dietary fiber
- carbo: grams of complex carbohydrates
- sugars: grams of sugars
- potass: milligrams of potassium
- vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical
percentage of FDA recommended
- shelf: display shelf (1, 2, or 3, counting from the floor)
- rating: a rating of the cereals (calculated by Consumer Reports)
Screenshots:
Findings :
- As you can see in the figure (the scatterplot ordering tab), there is a strong correlation between dietary
fiber and potassium.
- There are groups of cereals from which we can choose according to our
preferences. These groups are easy to fine during playing with the
minimum similarity bar.
- Healthy cereals with much dietary filters, less calories and less fats :
100% Bran, All-Bran with extra fibers, and All-Bran.
- Cereals for hungry people in need of energy : Muesli Cereals
- Just sweet cereals can also be easily identified at the leftmost
clusters.
- Customer rating is pretty much negatively correlated to sugars and
calories. (by using "cereal-updated.txt")
- It is easy to identify a group of cereals that contains much vitamines. (by
using "cereal-updated.txt")
Return to top | Netscan | Cereal |
US counties
Data source :
County-by-County data:
http://users.erols.com/turboperl/dcmaps.html ( I tried to access this page
again on 11/19/2004, but this page was not available.)
FIPS code - county names:
http://www.census.gov/population/cencounts/1900-90.txt
http://www.arfsys.com/indep.htm
( Denali Alaska 02068 )
Data files :
The meaning of each column :
- 1st column : County ID
- Name: County name
- HomeValue2000: median value of owner-occupied housing value, 2000
- Income1999: per capita money income, 1999
- Poverty1999: percent below poverty level, 1999
- PopDensity2000: population, 2000
- PopChange: population percent change, 4/1/2000-7/1/2001
- Prcnt65+: population 65 years old and over, 2000
- Below18: person under 18 years old, 2000
- PrcntFemale2000: percent of female persons, 2000
- PrcntHSgrads2000: percent of high school graduates age 25+, 2000
- PrcntCollege2000: percent of college graduates or higher age 25+, 2000
- Unemployed: person unemployed, 1999
- PrcntBelow18: percent under 18 years old, 2000
- LifeExpectancy: life expectancy, 1997
- FarmAcres: farm land (acres), 1997
- LungCancer: lung cancer mortality rate per 100,000, 1997
- ColonCancer: colon cancer rate per 100,000, 1997
- BreastCancer: breast cancer per 100,000 white female, 19970-1994
Screenshots & Findings:
Four selected histograms ranked by the biggest gap size. Gap detection was performed with standardized values (i.e. in this case all dimensions are transformed to a distribution whose mean is 0 and the standard deviation is 1).
The gap ranking criterion is affected by whether the original or transformed values are used for ranking. Ranking computations based on the original values (values before transformation), produce a different ranking result since the range of the values may change due to the transformation. The biggest gap is highlighted as a peach rectangle on each histogram. The bar to the right of the gap on (a) is for Los Angeles, CA. The bar to the right of the gap on (b) is for Coconino, AZ, which means that Coconino County has exceptionally broad farm lands.

(a) 21.0 |

(b) 5.77 |

(c) 0.38 |

(d) 0.24 |
Next, if users move on to the rank-by-feature framework for 2D projections, they can choose “Correlation coefficient” as the ranking criterion. And again they preattentively identify three very bright red cells and two very bright green cells in the score overview (b). The scatterplot for one of the high-scored cells is shown in Figure 7a, where LA is highlighted with an orange triangle in a circle at the top right corner. Interestingly, the three bright cells are composed by the three dimensions that have very low scores in 1D ranking by “Uniformity.” LA is also a distinctive outlier in all three high scored scatterplots. Users can confirm a trivial relationship between poverty and income, i.e. poor counties have less income (c). The scatterplot for one of the two bright green cells is shown in
(d), revealing that counties with high percentages of high school graduates are particularly free from poverty.

(a) 0.96 |

(b) 0.77 |

(c) -0.69 |

(d) -0.71 |
User can then run the
ranking by quadracity to identify strong quadratic relationships, producing 4
interesting scatterplots. Following 4 figures show relatively strong quadratic
relationships. It is interesting to know that they also showed strong linear
relationships according to the correlation coefficient ranking, but each pair of
variables in (a) and (d) actually have a quadratic relationship. (b) and (c)
show weak quadratic relationships. The fitting errors should be considered by
looking into the regression curve and points distribution before confirming the
relationships.

(a) 0.96 |

(b) 0.77 |

(c) -0.69 |

(d) -0.71 |
Return to top | Netscan | Cereal |
US counties |
Application Reports from Information Visualization Class