Introduction:

The Boston house price data was first published by Harrison, D. and Rubinfeld, D.L. in their paper ‘Hedonic prices and the demand for clean air.’ This is a very famous dataset in the field of statistical analysis; many have used it to prove the validity of alternative statistical techniques. The dataset that was used in this visualization is a corrected and expanded version that includes latitude and longitude information.

‘Hedonic’ in the title of the paper refers to hedonic analysis. It is a multivariate statistical technique in dealing with commodity heterogeneity, e.g. how to compare individual items when the items are sold only in packages. The question that was asked in the paper was what is the demand for clean air in the region? Hedonic analysis can answer these questions and give you a numerical answer. Information visualization can’t give us precise numbers, but can it show that people are willing to pay a premium for cleaner air?

Description of Data:

Each record of the data set represents observations of housing prices on a particular census tract. The following is a list of data item used in the data and its description:

CRIM	per capita crime rate by town
ZN	proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS	proportion of non-retail business acres per town
CHAS	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX	nitric oxides concentration (parts per 10 million) (referred to as pollution in document)
RM	average number of rooms per dwelling
AGE	proportion of owner-occupied units built prior to 1940
DIS	weighted distances to five Boston employment centers
RAD	index of accessibility to radial highways
TAX	full-value property-tax rate per $10,000
PTRATIO	pupil-teacher ratio by town
B	1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT	% lower status of the population
MEDV	Median value of owner-occupied homes in $1000's (referred to as housing price in document)

Visualization:

The following picture show the geographical location of the data samples color coded by the town, with a corresponding map picture of Boston for reference.

Figure 1

Figure 2

Pollution levels tend to be continuous on a geographic level. The picture below shows the geographical location of samples with the pollution level size coded. (Large dot represents high pollution) The color of the samples represents the price. (Blue represent high prices)

Figure 3

As you can see, areas of high pollution are clustered in Boston. There are also two extra clusters of pollution, though not as polluted, north east of Boston. The sources of the pollutants are the main industrial centers of Boston. If we plot the distance of the houses to the industrial centers with the pollution level, one can see a clear exponential fall off rate, which is consistent with the how pollution diffuses in the atmosphere.

Figure 4

One can get a sense of how Boston has grown over the years by viewing the age of the houses with the distance to the employment centers.

Figure 5

The X-axis is represents the percentage of houses built before 1940. Y-axis represents the distance to the industrial center, i.e. center of Boston. The graph shows the growth of the suburbs. The highlighted anomalous samples are located in Salem and Lynn, neither can be considered as a suburb of Boston. The spread of suburbs is a consequence of the rise of the middle class after World War II. The following graph shows this understanding.

Figure 6

The color of the sample represents the housing prices (Blue is expensive); the size of the dots is the pollution level; X-axis is the percentage of houses built before 1940; Y-axis is the percentage of lower class population. Less and less lower class occupy the newer houses, and because the price range of the houses isn’t very high, one can assume that it is the middle class that are increasingly buying the newer houses.

If we examine the housing prices, there seems to be no correlation between the pollution levels and housing prices, which suggest air quality is not a major factor in housing prices. There seems to be a mix of high and low priced houses even in the polluted areas. The following graph highlights just the most expensive houses.

Figure 7

There is a large difference between the pollution levels even in expensive houses. However, a direct comparison of level of nitric oxides vs. median housing prices shows air quality does matter.

Figure 8

There are two interesting features of the graph. One is the conspicuous absence of high valued houses in areas of high pollution (above 0.67 parts per million). The other is the absence of low valued houses in areas of low pollution. (Each color represents a town) We can’t conclude anything from just the above graph because we’ve already know that newer houses tend to have less pollution and are priced higher (see Figure 6) The graph below shows pollution vs. housing prices for the town of Cambridge. The dichotomy of the housing prices is striking. Because many of the independent variables are the same for those data points, one can conclude with certainty that quality of air does affect housing prices.

Figure 9

Critique of SpotFire:

Because SpotFire is such a general tool, one can show geographical locations of data. However, this is pretty useless without a map overlay. Using a picture background can be done, but the background scaling and position manipulation tools are too primitive to do a good convergence.

Another problem that came up while playing around with data, especially with changing the axes of the graph, is that there is no way of saving a particular graph setting. You can open up another graph window but there is no way of copying over the same view automatically, useful when doing a multi-window view of same data. They can add save slots that save the current setting of the active window, accessible by short-cut keys. These saved settings can then be applied to any window. Another very nice feature to have would be an undo/redo feature akin to a web browser.

Other than that particular problem, exploration of the data was easy and painless. The hard part was learning the data to explain trends and outliers, not finding them.