Visualizing Housing Conditions in Maryland
Chang Hu
2006-3-15
1. Introduction
What's the most common housing type for people in Maryland? Are newer houses bigger and more expensive than old ones? Housing is a concern for younger as well as elder people, for landlords as well as tenants, for home-owners as well as dealers. While large amount of data are kept and available for free on government web sites, further analysis is still needed.
This report is going to analyze Maryland housing data from Census2000 [1] dataset. The task is to study the type of housing conditions versus social-economic factors. I mainly used Treemap [2] to do the analysis, since it provides a nice understanding of different groups in the dataset. An overview of the dataset is provided by the Hierarchical Clustering Explorer (HCE, [3]), too.
2. Dataset and preprocessing
The data set is housing data for Maryland in the Census2000, which could be accessed via www.census.gov and could be downloaded in .CSV format. The data set contains 9675 samples, each of which is a housing unit having 101 attributes. The attributes include the information about the physical condition of the housing unit (e.g. year built, structure type, etc.), the cost to maintain the housing unit (e.g. heating or electric fee, etc.) and information about its dweller(s) (e.g. the number of person, the household income, etc.).
The first major need of preprocessing is filling missing values. According to the Census2000 housing codebook, those missing values represent a "not available" or "don't exist" case. Therefore, those missing values are filled as zeros. Data preprocessing are done in Excel.
From processed data, selected columns (attributes) are exported into text files for visualization in HCE and Treemap, respectively. By visualizing fewer data, computational burden could be alleviated, resulting in faster response in from both tools. This is because of the limited system resource with which the two tools are run.
3. Initial Analysis with HCE
Even when the number of attributes is reduced, discovery by directly looking at the dataset is difficult. First of all, there are still too many attributes to discover any relation among them. Secondly, the number of samples and the diversity over the sample make it even harder for such discovery.
A first attempt is made with HCE 3.5, which provides the GRID framework. The GRID framework provides a quick overview of the pairwise relationship between the attributes, which may provide a direction of refined comparison in Treemap.
Fig 1 Results from HCE
Number of rooms vs. Year first built
Left: scatter plot; Right: scatter plot with density
The result showed that there is high uniformity over most of the attribute-pairs. A further look into the density of the scatter plot shows some difference in frequency. To gain a more detailed insight, further analysis is done in Treemap, focusing on comparing number of samples in each group.
4. Analysis with Treemap
4. 1.What's the most popular housing type?
A treemap is drawn (Fig 2) basing on the housing type, the number of person(s) in the unit and the value of the housing unit.In the treemap, Samples are grouped into different housing types (houses, apartments, etc.). The nodes are color-coded in terms of the value of the housing unit, where a lighter region indicates higher value. The size of the node shows the number of person(s) in the unit.
Fig 2 Popular units
Group – type of housing unit |
(as shown in figure) |
Color – value of house |
(lighter – higher value) |
Size – number of person(s) in each unit |
(larger – more people) |
The left half of the figure is a large group showing the most popular housing type, the one-family house detached. Its lighter color also shows that this type of housing units usually have a higher value, too. The one-family house attached runs second, which reconfirms the popularity of houses in a whole.
4. 2.Are new houses bigger and more expensive?
In this treemap (Fig 3), the averaged value and size of housing units shown, grouped by the year they were first built. Here the actual units are not shown, because the number of housing units (as shown in Fig. 2) may introduce a bias into the area of each group in the treemap. For example, when the area of a housing unit is encoded as the node’s size, it is not fair to say the larger the whole group is, the larger houses in that group are, because one group could contain more nodes than the other.
It is shown in the follow treemap (Fig. 3) that newer houses are not significantly larger, since the nodes are of similar sizes. New houses are indeed more expensive, as shown by their brighter color.
Fig 3. Area vs year-first-built
Group – year first built |
(as shown in figure) |
Color – value of house |
(lighter – higher value) |
Size – number of rooms in each unit |
(larger – more rooms) |
Here, further study into the different size groups with in one year group will be helpful. The reason is that although the average sizes across year groups are similar, the deviation within each group can still be different. Unfortunately, the actual area of the housing units is not available in this data set (here the area is approximated by the number of rooms, with the assumption that room sizes are similar).
4. 3.Maryland empty-nesters
When visualizing the relationship between house size, household income and number of dwellers, an interesting pattern was discovered (Fig. 4). In groups representing larger houses (with larger number in title), there are several small dark nodes at the lower right corners. In this squarified treemap, lower right corner is where the smaller nodes are. In short, there are large houses in which only a small number (one or two) of people live. What is more, those people have a very low income. It is more reasonable that larger families live in larger houses, and that larger houses are more expensive to maintain. Then, who are those living in large houses alone? How can they afford the house?
Fig 4 An interesting pattern
Upper: overview; Lower: zoom into group 9
Group – number of rooms |
(as shown in figure) |
Color – household infome |
(lighter – higher value) |
Size – number of person(s) in each unit |
(larger – more people) |
Further analysis is done by dividing each group into subgroups – by the number of people over 65 in the house (Fig. 5). The pattern is more obvious in subgroups with older-than-65 people. By studying those regions node by node, it is discovered those are old people living in large houses, which perhaps once filled by their children. In other words, they are the empty-nesters in Maryland.
Fig 5 Empty-nesters
Upper: overview; Lower: zoom into group 7
Group – number of rooms |
(as shown in figure) |
Color – household infome |
(lighter – higher value) |
Size – number of person(s) in each unit |
(larger – more people) |
The above treemap has also shown that there are more elder people living alone rather than in couples. This is an interesting fact revealed by treemap, which worth further study.
5. Comments
I found treemap a very powerful tool, with suitable data fed in. It is also an art to encode different attributes so that certain pattern could reveal itself. Up to now, I found the most effective way to show treemap data is to fit three attributes into the “group – color - size” triple.
However, there is a paradox. These visualization tools are most powerful when the user knows how to select data attributes and subgroups, as well as visual encoding to let certain pattern stand out, while it is hard to know how to do such before visualizing the data. Maybe it’s just a reconfirmation that visual analysis is an iterative process.
On my computer whose memory and CPU power are limited, it takes quite a while before each time Treemap finishes drawing the screen. Actually it seems to take equally long to update a treemap and to redraw an existing treemap. That’s one of the reasons why I preprocessed the data. Another reason is that it takes much shorter time for HCE to load a text file (than loading an Excel file). HCE is also more stable with all missing values filled in beforehand.
6. Reference
[1] Census 2000, U.S. Census Bureau, http://www.census.gov/main/www/cen2000.html
[2] Treemap, http://www.cs.umd.edu/hcil/treemap/
[3] HCE, http://www.cs.umd.edu/hcil/hce/