Visualizing Missing Data: Classification and Empirical Study
Abstract. Most visualization tools fail to provide support for missing data. We identify sources of missing, and categorize data visualization techniques based on the impact missing data have on the display: region dependent, attribute dependent, and neighbor dependent. We then report on a user study with 30 participants that compared three design variants. A between-subject graph interpretation study provides strong evidence for the need of indicating the presence of missing information, and some direction for addressing the problem.
Information visualization provides an effective way for users to rapidly find trends in data and values of attributes of interest. The use of color, position, and shape contributes to helping users seeing patterns and outliers. Preserving the integrity of data exploration requires the use of visualization techniques that present data accurately without introducing misleading patterns or masking data properties. In particular, we believe that poor handling of missing and uncertain information can have a strong influence on users interpretation of the data (Fig. 1).
Fig. 1. In this figure the data seems to be stable, with a sharp increase starting in 88. Practically no data was collected until 89, so this interpretation is wrong.
When data is missing (e.g. there an empty cell in a data table), many tools will simply crash. Others will nicely inform users to “fix” the problem, which most users do by entering a value such as zero. As a result, it is often impossible for others to discriminate a value of zero from missing data. This paper categorizes possible reasons for data to be missing, differentiates three types of visualization techniques according to the impact missing data can have on the display and its interpretation, and reports on a user study comparing three implementations.
2. Sources of missing data
As part of our research on making government statistics more accessible to the public (see Govstat project http://ils.unc.edu/govstat/) we found five main reasons for data to be missing:
The most trivial reason for missing data is that data was simply not collected. Equipment or sensors can malfunction, a survey can be misprinted, and files can be lost.
Data Source Confidentiality
Privacy protection can affect how findings are presented when publishing results of human-centric surveys or experiments. When the publication of a value might provide clues to the identity of individuals, that data must be omitted or presented aggregated at a higher level. For instance, when an organization publishes the average salaries of employees based on position and gender, the actual salary of the only female Vice President will be revealed. Publishing an empty cell is a solution, but if the number of male Vice Presidents is known, the aggregated data by position will also indirectly reveal her salary and should be omitted as well.
Redefined Data Categories
In statistical and
demographic computation, data is often aggregated into classes or ranges
. Although aggregation is often
necessary for efficient data presentation, problems arise when a class or range
is redefined after data has been compiled.
Mutually Exclusive Multivariate Combinations
There are instances when combinations of data variables are impossible or highly improbable. Consider the example where the two variables of a dataset are age and cause of death by a firearm. Since it is not realistic to determine that a child of less than five years of age committed suicide, such category of data can be described as non-existing instead of having a value of zero.
Uncertainty deemed excessive
In some cases problems with small sample size, flawed methodology, and lack of data to use for estimation can contribute to high uncertainty for certain data values. The authors of a study or report might decide to publish a simplified version of the dataset that does not include data with uncertainty over a certain threshold.
3. Classification of Visualizations
All visualizations use graphic elements to represent data, and we found that there are three categories of techniques (in respect to how much impact missing data has on the display) depending on how the position of the graphic elements is computed . The position of the graphic elements can be: 1) dedicated to the data item independently of the attribute values, 2) entirely a function of attribute values, or 3) a function of the attributes values and the values of neighboring items.
An example of the first category (“dedicated”) is a line graph in which the graphic object representing a data value is a dot with a dedicated X location. The values of other data items have no influence on the position of the graphic object. At most, the minimum and maximum values impact axis calibration. Choropleth maps and techniques relying on ordering can fall in this category. For this type of visualization, if the data is missing then no object is displayed at the corresponding location, and the absence of data should be easily detected since users will be expecting to see a value there (Figure 2).
Fig. 2: Voids can be easily detected when there is a dedicated location for each data object
An example of the second category (“attribute dependent”) is a scatter plot. In a scatter plot the position, color, and size of a graphical object is entirely based on the data item attribute values. If a data item is missing, there is nothing in the basic scatter plot display that indicates the existence of missing data value (Fig. 3).
Examples of the third category (“neighbor dependent”) are pie charts and treemaps. Here, the size and placement of a wedge or box representing the data item is a function of both the data item attribute values and neighboring items. If a data item is missing, simply omitting it from the display will not only go unnoticed but it will also bias the appearance of other items (Fig. 4). This is a characteristic of all the space-filling techniques.
Fig.3 Attribute dependant example: In a scatter plot missing data is not noticeable.
Fig.4 Neighbor dependant example: In a pie chart, not only is missing data not noticeable but it also biases the other data items (by making the other wedges larger than they should really be).
Cases of hybrid techniques exist as well. For example, with parallel coordinates, an omitted data item will go unnoticed because the position of the line is entirely a function of attribute values; but a missing attribute value might be noticed as the location for that attribute is dedicated and the line can be rendered broken or connected to a separate location for missing values.
We found three data visualization enhancements that could be used to provide effective indication of missing data and confidence intervals. They include:
Dedicating visual attributes essentially involves associating color, texture, shape, or any combination of these with data point appearance in order to indicate missing values or specify confidence ranges. Annotation, on the other hand, would allow users to gain further insight into missing and unreliable data through text or graphic information presented outside of the scope of graphic element appearance. Lastly, animation can provide a series of data display transitions that allow users to view several different perspectives in a short period of time. Animation can be helpful in temporary highlighting missing data, then adding estimated values, based on the preference and/or exploration goals of the user.
4. Related Work
Researchers in scientific visualization have given more attention than those in information visualization to missing data as well as uncertain data. In addition to specifically identifying sources of uncertainty, Pang et al.  discuss a classification of methods, present an overview of visual attributes that can be modified to indicate uncertainty. Pham and Brown  propose a list of relevant visual features that can be used to indicate data value imprecision (including hue, luminance, size, transparency, depth, texture, and blur) and present examples of “fuzzy” data. Cedlink and Rheingans , also providing clues and annotations such as grid lines. Restorer  uses grayscale to indicate missing (and therefore estimated) data on color map. Djurcilov and Pang  discuss visualization techniques they used to analyze a sparsely populated meteorological dataset. Here a missing value is not an error but an indication that no phenomena were observable at a given point. They argue that missing data points should not be estimated (as is usually the case), but presented in a way that alerts the user of “non-observation”. In contrast, Dybowski and Weller  address the problem of displaying missing information to users by computing estimates and ranges.
MANET and XGobi attempt to make users aware of missing data and uncertainty. They use complementary display that indicates the proportion of a missing data. For example in XGobi, a scatterplot is shown is two windows. One contains the data, the other displays a shadow plot that indicates the data values that are complete, or missing the x, the y, or both attributes. Our exploration of the existing techniques highlights diversity of techniques and the challenge of providing visualization techniques that alert, yet do not distract. A common problem with the existing technique is that missing or uncertain data often ends up catching users’ eye more than the “good” data.
Empirical studies reporting on how users deal with missing or uncertain data are rare. Other studies involving graph interpretation (e.g. 18, 19, 20) assume a complete data set that did not include missing data. The following section discusses the pilot study we conducted to better understand how users interpret simple graphs that include missing data.
5. Empirical study
Our goal was to study users’ ability to compare data values and draw accurate conclusions about trends when data is missing, using three different displays. We wanted to be able to observe users dealing with missing data without making it obvious that missing data was the focus of our study, so each group of participants used only one of the three interfaces (i.e. we used a between subject design) and we asked them to answer some questions that involved missing data as well as some questions for which all the data was available.
Thirty people from the
Microsoft Excel was used to create four separate time-sequence graphs. The graphs were then modified in a graphic presentation tool to transform them as necessary into one of the model variants. A tool was developed in C# to automate the presentation of the questions and displays, and collect time and preferences.
Figures 5-7 show three displays of the same data. In the Misleading display (Fig. 5), data values are encoded as 0. In the Absent display (Fig. 6) missing data is completely omitted from the display, and the line graph appear as broken when no data exist. The Coded display (Fig. 7), also omits missing data points but it adds an icon on the next present data point in the series that indicates why the prior data points are missing from the data set.
Fig. 5: Misleading Display - Missing data points are replaced by a default values (0).
Fig. 6 Absent Display - Missing data points are omitted.
Fig. 7 Coded Display - Missing data points are omitted, and the next valid point in the series has a mark which provides the reason why prior points are missing
We hypothesized that participants using Coded or Absent displays would be more accurate than participants using the Misleading display. We predicted that participants with Absent displays would have a shorter response time because they would have less information to digest, and that confidence and accuracy would be similar for users of both Absent and Coded displays and higher than Misleading. We thought that users would prefer the Coded version because it provides explanations.
Participants signed the Informed Consent form and watched a brief slide show which explained a sample graph of the type they would be using during study. Instructions for answering comparison-based questions were provided. More specifically, to ensure uniformity in responses, participants were advised to answer questions of the form “Compare the value of X to Y at time t” in the form “X is greater/lower than Y”. Next, each participant was given a brief overview of how the study would be executed.
They answered 13 questions. For each question the procedure was the same. The written question appeared on the screen. Once they had read the question and felt that they were ready to continue, they would click a button and a graph was displayed for five seconds, then hidden. The question reappeared along with a set of multiple-choice responses. For every question users could reply that they didn’t have enough information to answer. After they had selected an answer (based on recall) and provided a confidence rating from 1 to 10, the graph reappeared and they were given a second opportunity to answer the same question while viewing the graph. The 1st answer measured the accuracy attained after a rapid glance at the graph , while for the final answer users had time to study the graph more carefully. After completing the study (using only one type of display: Misleading, Absent or Coded), users were shown examples of the other 2 displays and asked to choose the display they would prefer to use to answer the type of questions they had been given.
During the entire 20-minute session, the experimenter was seated beside the participants. She answered questions before the start of the experiment, observed participants and then asked clarifying questions after the experiment. There were four types of questions: (the parenthesis contain the notation used in the result charts)
- Value Comparisons where both points were Present (CP)
- Trend-related questions concerning only Present data (TP)
- Value Comparisons where one of the two points was Missing (CM)
- Trend-related questions involving Missing data (TM.)
The data was made-up but realistic, carefully chosen so that it did not allow users to make conclusions based on their knowledge of the world, but based solely on the graph data they saw. For example data was about preferences of people from other planets, or imaginary illnesses. A complete list of sample graphs and questions used can be found at: www.cs.umd.edu/hcil/govstat/cyntricadata.html).
Fig. 8 shows the average number of correct answers based on recall after a 5 second glance at the data. For each display there were 10 participants so a value of 10 means that all participants answered the question correctly every time, and a value of 0 means that none of the participants were able to answer the question correctly. For questions where all the data was present (CP and TP) users made a few mistakes, but the striking result is that none of the users were able to answer correctly to any of the questions involving missing data (CM and TM) using the Misleading display (remember that this is a commonly used way to present missing data). In each instance, participants indicated a definite trend or made a comparison between values as opposed to indicating that there was not enough information to answer the question. Even after being given more time to look at the display, they rarely changed their answers (Fig. 9). Users performed better with the Absent and Coded displays, but trends were still a problem, with great variability among users.
Fig. 8: The average number of correct responses based on recall after a 5 sec. glance at the data. The right 2 sets of bars show that users using the misleading display could not answer any of the questions correctly when missing data was involved (CM and TM).
Fig. 9: The average number of correct final responses given while viewing the graph directly on the screen. Overall, users didn’t change their answers when given more time.
Our hypothesis that participants with Coded and Absent displays would be more accurate than their counterparts using the Misleading displays was verified. The differences were significant when users compared between a missing value and a present data point (p < 0.05 for CM questions) and but less so when users have to describe a trend that incorporates missing values (p < 0.10 for TM questions). A closer look at the results showed that none of the participants using the Absent display answered two questions correctly. Both of these questions involved trend lines in which data was missing from the display. In both cases, the majority of users seemed to have constructed a confident opinion about the trend in the data based only on a few points of data shown in the display, as opposed to concluding that they did not have enough information to decide.
This supports our initial claim that poor indication of missing values can have a negative impact on data interpretation, but also suggests that even when missing data is indicated clearly users may not resist the temptation to find trends in partial data.
No significant differences between displays were found for confidence (Fig. 10 and 11). Users were confident in their answers. The average confidence value was nearly 8 for each of the models and for all of the questions, after 5 seconds and also when given more time.
Fig. 10: Users were very confident after viewing the graphs for only 5 seconds,
even in treatments where they made lots of errors (in CM and TM)
Concerning the time to answer, no significant differences where found either, contradicting our hypothesis (Fig 12). For six of the thirteen questions answered, users with Coded displays had longer average response times. For four questions Absent displays had the longest response times while only two questions required more time to answer with the Misleading displays. Users of the Misleading displays seemed to behave as if the display was relatively straightforward and did not feel that they needed an extended period of time to ponder a response while some users of the other displays seem to hesitate more, but not all of them did so.
Eight users never changed their mind between the first answer and the final answers, while seventeen users made one or two changes, and five users made three or more changes. On a category-by-category breakdown, of the eight participants who changed answers with the Misleading display, an average of two answers were modified with an average of one answer actually being changed to the correct answer. Users with Absent displays, changed an average of three questions, with an average of two modified to the correct response. Finally, users with Coded displays modified an average of two responses with an average of two actually being modified to the correct reply. Of the ten participants using the Misleading display, only one (a math major) commented at the end of the test that he was starting to suspect that missing data might have been an issue.
Fig. 11: The final confidence level remains very high.
Fig. 12: The average time to give the final answer (while directly viewing the graphs).
There were no significant differences.
When asked about their preference at the end of the test 27 users out of 30 selected the Coded display. They commented that they liked the idea of having more information available. Surprisingly two participants favored the Absent display over all three because they felt the Coded display was confusing. In the Coded display, the first present data point to appear after a series of missing points is encoded to convey the reason why previous data values were not available and this was found confusing. Finally, one user preferred the Misleading display because he liked the continuity of the graphs.
Accurately displaying missing and uncertain data presents an interesting challenge for information visualization. We hope that our general classification of visualization techniques will provide a useful basis for building and comparing techniques that represent missing data. Our study looked at how users interpret graphs with missing data. It suggests that users may not realize that data is missing when it is replaced by a default value. In real situations, the rate of error might be reduced because users can take advantage of their world knowledge to spot unlikely values. Furthermore, the study revealed that even if the missing data is noticeable, users are compelled to make general conclusions with partial data.
Participants preferred the coded display that provided additional information on the reason for the data to be missing. Some subjects voiced concern about the actual design of the coded display, suggesting that improvements could be made. Further studies of the impact of missing data on the more difficult cases of attribute dependant visualizations and neighbor dependant visualizations are needed as well.
This research was
supported in part by the
1 Babad, Y.M., Hoffer, J.A. 1984. Even No Data Has a Value. Communications of the ACM 27(8), 748-756
2 Cedlink, A., Rheingans, P. 2000. Procedural Annotation of Uncertain Information. IEEE Visualization, 77-83
Chi, E. H. 2000. A Taxonomy of Visualization
Techniques Using the
4 Djurcilove, S., Pang, A. 1999. Visualizing Gridded Datasets with large Numbers of Missing Values. IEEE Visualization, 405-408
5 Dybowski, R., Weller, P. 2001. Prediction Regions for the Visualization of Incomplete Datasets. Computational Statistics16(1), 25-41
6 Ehlschlaeger, C. 1998. Exploring Temporal Effects in Animations Depicting Spatial Data Uncertainty. Available at: http://www.geography.hunter.cuny.edu/~chuck/aag98/
7 Gershon, N. 1999. Knowing What We Don’t Know; How to Visualize an Imperfect World. ACM SIGGRAPH Computer Graphics 33(3), 39-41
8 Healy, C. G., Booth, K.S., and Enns, J. T. 1996. High Speed Visual Estimation Using Preattentive Processing. ACM Transactions on Human Computer Interaction 3(2), 107-135
9 Howard, D., MacEachren, A. 1996. Interface Design for Geographic Visualization: Tools for Representing Reliability. Available at: http://www.geovista.psu.edu/publications/others/howard/howmac96.html
10 MacEachren, A. M., Brewer , C. A., and Pickle, L. 1998. Visualizing Georeferenced data: Representing reliability of health statistics. Environment and Planning: A 30, 1547-1561.
11 Olston, C., and Mackinlay, J. 2002. Visualizing Data with Bounded Uncertainty. In Proceedings of the IEEE Symposium on Information Visualization, 37-40
Pang, A. T., Wittenbrink, C. M., Lodha, S.K.
1996. Approaches to Uncertainty
Visualization. Technical Report
13 Pham, B., Brown, R. 2003. Visualization: An Analysis of Visualization Requirements for Fuzzy Systems. First International Conference on Computer Graphics and Interactive Techniques, 181-187
14 Shniederman, B. 1996. The Eyes Have It: A Task by Data Type Taxonomy of Information Visualizations. IEEE Visual Languages, 336-343
15 Swayne, D. F., Buja, A. 1998. Missing Data in Interactive High-Dimensional Data Visualization. Computational Statistics 13(1), 15-26
16 Twiddy, R., Cavallo, J., and Shiri, S. 1994. Restorer: A visualization technique for handling missing data. In IEEE Visualization 94, 212-216
17 Unwin, A. Hawkins, G., Hofmann, Siegl, B. 1996. Interactive Graphics for Data Sets with Missing Values – MANET. Journal of Computational and Graphical Statistics 5(2), 113-122
18 Beichner, R. 1994. Testing Student Interpretation of Kinematics Graphs. American Journal of Physics 62, 75-762
19 Roth, W., Gervase, M.B. 2003. When Are Graphs Worth Ten Thousand Words? An Expert-Expert Study. Cognition and Instruction 21(4), 429-473
20 Brassuer, L. 1999. The Role of Experience and Culture in Computer Graphing and Graph Interpretive Processes. Proceedings of the 17th annual international conference on Computer documentation. 9-15
C., Plaisant, C., Drizd, T., 2003. The
Challenge of Missing and Uncertain Data
Poster in the Visualization 2003 Conference compendium, IEEE, 40-41
[*] At the time this research was conducted, Terry
Drizd was working at the