An Exploratory Study of Video Browsing User Interface Designs and Research Methodologies: Effectiveness in Information Seeking Tasks

Tony Tse, Sandor Vegh, Gary Marchionini*, Ben Shneiderman
University of Maryland, College Park, Maryland
*Current Address: University of North Carolina, Chapel Hill, North Carolina


The purpose of this exploratory study is to develop research methods to compare the effectiveness of two video browsing interface designs, or surrogates—one static (storyboard) and one dynamic (slide show)—on two distinct information seeking tasks (gist determination and object recognition). Although video data is multimodal, potentially consisting of images, speech, sound, and text, the surrogates tested depend on image data only and use key frames or stills extracted from source video. A test system was developed to determine the effects of different key frame displays on user performance in specified information seeking tasks. The independent variables were interface display and task type. The dependent variables were task accuracy and subjective satisfaction. Covariates included spatial visual ability and time-to-completion. The study used a repeated block factorial 2x2 design; each of 20 participants interacted with all four interface-task combinations. No statistically significant results for task accuracy were found. Statistically significant differences were found, however, for user satisfaction with the display types: users assessed the static display to be "easier" to use than the dynamic display for both task types, even though there were no performance differences. This methodological approach provides a useful way to learn about the relationship between surrogate types and user tasks during video browsing.


Digitized video is becoming commonplace as both network bandwidth and processing power increase while costs decrease. Applications include digital libraries, video conferencing, and video-on-demand in areas such as medicine, education, and entertainment. Consequently, efficient video retrieval and management tools for end-users are needed. For small collections of videos with well-known attributes (e.g., genre or title), existing text-based information retrieval techniques are sufficient. However, as the number of video records increases and their attributes become less clearly defined (e.g., sonograms, video conferencing, and videotaped lectures), finding videos relevant to users' needs becomes more problematic. Clearly, effective mechanisms for searching video data are required.

One approach is to use physical features such as color, motion, shapes, or brightness data in the indexing and retrieval processes. Algorithms that detect changes in these properties have been used to automate video indexing; End-users search on properties and retrieve a set of video documents that match the criteria. For example, IBM's Query by Image Content (QBIC) system indexes physical attributes and allows users to create visual queries including drawing shapes of target objects or identifying colors known to be in desired scenes (Flickner et al., 1995). Another approach is to combine a variety of data channels for "higher resolution" indexing. For example, Carnegie Mellon University's Informedia Project takes advantage of non-visual features such as speech recognition of dialog or closed caption text in addition to the shot detection algorithms to automate the indexing of video segments (Wactlar, Kanade, Smith, and Stevens, 1996).

However, once video data has been indexed using such algorithms, efficient mechanisms for users to select the most relevant video documents or segments for their specific needs are required. Relevance criteria are likely to vary among users and task-specific needs. In addition, many of these criteria are not easily expressed explicitly; unlike text-based indexes, a standardized grammar has not yet been developed for images. Thus, a video retrieval interface based solely on formal analytical strategies is not likely to satisfy user needs.

The approach used in this study provides users with still images representing short segments extracted directly from a video document (i.e., key frames) for direct visual inspection. Users can browse the data set for visual nuances that may be of interest and importance in making relevance judgements about whether to explore a video further. The question becomes how best to display the surrogates in an interface to optimize browsing, "…an approach to information seeking that is informal and opportunistic and depends heavily on the information environment" (Marchionini, 1995, p. 100). In this study, two "information environments" or interface designs that support rapid browsing of key frames (static, or storyboard, and dynamic, or slide show) are tested for their effectiveness in different types of information seeking tasks (gist determination and object recognition).

Humans excel at making judgments and planning complex actions, whereas machines are good at repetitive tasks (Shneiderman, 1998). In visual searching, humans are much better at rapidly finding patterns, recognizing objects, generalizing or inferring information from limited data, and making relevance decisions (Helander, 1988). Machines are much more efficient at measuring and detecting discrete changes in physical properties, organizing and storing large amounts of data, and creating large numbers of video representations.

The framework used in this study leverages differences between humans and machines: Machines are used to organize and manipulate large amounts of digital video and filter the number of potentially relevant documents in response to a user query (the analytical approach). When presented with video surrogates in some organized manner (e.g., rank-ordered), users browse surrogates created by the system and decide which ones are most relevant to their needs. In this manner, the speed and accuracy of computer systems in large-scale, repetitive actions complement the power of the human visual and decision-making systems.

Review of the Literature: Video Surrogates

A number of different video surrogates have been proposed in the literature. O'Connor (1991) described using "contour maps" or individual frames extracted directly from a video document that are representative of the most important events. Key frames are the fundamental units of the video browsing interface designs in this study. Yow and Yeung (1997) created "video posters" to abstract highlights of video segments. Salient Stills (Teodosio and Bender, 1993) used optical flow computations for creating surrogates that preserve motion data by selectively representing objects in motion while keeping the background constant.

Other types of video surrogates have employed "higher order" structures to emphasize temporal relationships between key frames. The Video Streamer (Elliot, 1993), a stack of still video images, formed a three-dimensional video "block." Patterns along the edges could be used to identify scene changes and motion. Zhang, Low, and Smoliar (1995) used a hierarchy of key frames and provided users with control over the level of resolution for viewing the surrogate. At the root level, a single key frame represented the entire video. At lower levels, greater numbers of key frames could be revealed. All key frames were presented "filmstrip style" at the lowest level. By using the key frames themselves as indices, this surrogate allowed viewers to zoom in, conceptually, on a specific portion of the video while minimizing screen space required by showing a limited number of stills at any given time. Yeung, Yeo, Wolf, and Liu (1995) used a hierarchical scene transition model with key frames as nodes and connections between edges to represent motion among the nodes. Wectlar et al. (1996) devised a video skimming technique that preserves the motion of the original video. In contrast to the surrogates that use static mechanisms to represent motion, the video skim is dynamic in that it is, itself, a short video segment from a previously identified video event, used to represent a longer segment of video. Whereas the static surrogates are analogous to movie posters, dynamic surrogates, such as the skim, are similar, conceptually, to movie previews or coming attractions.

Many different and innovative ways to abstract or represent video have been proposed and devised. Since each surrogate type requires only a fraction of the time to view, as compared to full-motion video, many more videos can be considered within the same unit of time. For a discussion of time vs. accuracy trade-off and compaction measurements using various video surrogates, see Tse, Marchionini, Ding, Slaughter, and Komlodi (1998). In theory, each of these techniques saves users time and effort by providing data in highly compact and abbreviated formats while maintaining the "essence" of the video data. But how well do they work in supporting user information seeking?

Empirical data from several studies conducted at the University of Maryland have addressed some usability and effectiveness issues for video browsing surrogates. Ding, Marchionini, and Tse (1997) investigated the effect of keyframe display rate in the slide show interface, a dynamic surrogate, on human perception and task performance. Participants completed two tasks, object identification and gist determination, at various display rates, measured as keyframes per second (kfps). In the former task, users indicated whether specific objects were present in any of the key frames they browsed, while the latter task required users to identify the thematic or narrative content of the segment. Preliminary data showed that accuracy for the object identification task decreased as display rate increased, with the biggest performance degradation between 8 and 12 kfps. In addition, the participants perceived that, at a given display rate, gist determination was "easier" than the object identification task. Slaughter, Shneiderman, and Marchionini (1997) explored the effects of multiple simultaneous slide show displays, an alternative type of dynamic video surrogate, on object recognition and gist determination. Participants completed two tasks, object recognition and gist comprehension, after viewing up to four slide shows at a time, each presenting different video segments at 1 kfps each. The data showed that effectiveness decreased as the number of simultaneous displays increased, with the largest drop in performance at four simultaneous displays. In comparing the effectiveness of the slide show display, a dynamic surrogate, with a static storyboard display, Komlodi and Marchionini (1998) concluded that static displays were better than slide shows for object identification but there was no overall difference between display types for gist determination. Furthermore, subjective satisfaction slightly favored the static display. This exploratory study builds on and extends the previous research on video browsing surrogates.

Statement of the Problem

Overall, the goal was to design a more systematic methodology for conducting user studies on video browsing surrogates using well-defined user information seeking tasks under controlled conditions. Specifically, this exploratory study investigated the effectiveness of two types of video surrogates (storyboard and slide show) on user performance in completing two task types (gist determination and object recognition). Although Komlodi and Marchionini (1998) found a performance tradeoff between dynamic and static video browsing displays, the nature of this tradeoff was not clear. For example, do particular surrogates support the performance of particular tasks better than others—is there an interaction effect? New approaches and methodologies (e.g., gist determination task) were used in this study to address such questions raised by previous work. Such methodologies, once refined, could then be used to collect empirical data on the effectiveness of any video browsing surrogate for a battery of user needs. The results would allow for a direct comparison among different surrogates for a particular task.

Surrogate Types. The display types, as shown in Figure 1, represent distinct categories of video surrogates. The storyboard (SB) surrogate is a static display. All key frames are displayed in an array and users must scan them left to right and top to bottom, like viewing a contact-sheet or reading a comic strip—each subsequent frame provides an image representing the next major event. Viewers must mentally fill in the events between frames. The slide show (SS) surrogate, on the other hand, is dynamic and requires less visual scanning. Each key frame is "flashed" on the screen for a limited amount of time sequentially, allowing users to fix their eyes on a single location where the images are displayed. Conceptually, the SS design is more similar to video as its preserves the temporal dimension through motion.

Figure 1. Schematic of the storyboard (SB) video surrogate, a static display, and the slide show (SS) video surrogate, a dynamic display.

Task Types. The selected tasks represent two different types of user information seeking needs. Gist determination (GD) represents tasks in which users are trying to learn what the video is about, a goal-oriented task. For example, a biology teacher may wish to select some footage illustrating different types of social behavior among lions. In contrast, object recognition (OR) is task-oriented. The user needs to be able to recognize whether a particular object or relationship exists in any of the key frames (e.g., an astronomy teacher looking for footage of a solar eclipse).


A 2x2 repeated block factorial (RBF-22) design was used. Because each participant received all four interface-task treatments (i.e., slide show/gist determination; slide show/object recognition; storyboard/gist determination; storyboard/object recognition), individual differences among participants were addressed by the design. Randomizing the order of the four interface-task combinations/treatments a goal-oriented task limited possible learning and fatigue effects.

Hypotheses/Experimental Questions

Hypothesis 1: There will be statistically significant differences at the .05 level in performance between display type and user task.

The objective of the GD task is to comprehend the basic theme presented in the video represented by the surrogate. Because video is a dynamic medium (i.e., moving images), the temporal dimension and, in particular, the relationships between objects (e.g., cause and effect) caused by the perception of motion, are a vital part of the narrative structure. As the SS display preserves the temporal dimension, it was predicted that this feature would support narrative comprehension, improving performance on the GD task. In contrast, since the OR task requires detailed examination of the images and is not dependent on temporal flow, we predicted that the static SS display would better support OR performance. Furthermore, that users consistently found GD to be "easier" than OR after using the SS display at various rates (Ding, Marchionini, and Tse, 1997) supported this prediction.

Hypothesis 2: Subjective satisfaction will be higher for the storyboard (SB) design than for the slide show (SS) interface overall. However, satisfaction with the slide show (SS) interface will be higher for the gist determination (GD) task than for object recognition (OR).

Previous studies (e.g., Komlodi and Marchionini, 1998) have shown that users rated the SS design as less satisfactory than the SB interface. However, it was predicted that users would find the SS design more satisfying for the GD task than for OR because of the "better fit" conceptually, as described for the first hypothesis.

Experimental Design

Independent Variables

Dependent Variables Possible Covariates Participants. 16 females and 18 males participated in at least part of the study. All were students at the University of Maryland–College Park: There were 24 undergraduates, most of whom were taking introductory psychology and 11 graduate students from a variety of disciplines. All participated on a voluntary basis, although the psychology students received credit for their participation. The mean age of the participants was 23.8 with a range from 18 to 50 years. All but one of the participants had some Web experience and all reported some experience with graphics.

Note: Only 20 transaction logs of the 34 participants were complete (i.e., results were available for all four task x interface treatments). Twelve data files were affected by a programming error; one data file was damaged; and one file had missing data (no answer was recorded for the object recognition task using the storyboard interface design).

Software. The test system was developed in MS Visual Basic 3.0. Participants progressed through the trials by pressing buttons marked "Continue" on the bottom of each screen. The storyboard interface displayed all 12 key frames for a clip on one screen in a 3x4 array. The first four key frames were placed in the top row, ordered from left to right. The next four were in the middle row and the last four in the bottom row. For the slide show interface, images were displayed at a rate of three key frames per second and set to play in a continuous loop. Answers to task-based questions and immediate feedback satisfaction surveys were completed online through selection of predefined answers. The only input device required was a mouse. The software also included a module to randomize the interface-task treatments: different participants would receive each of the four experimental treatments in a random order (to control for learning effects). Text file transaction logs automatically recorded the image set used, the interface-task combination tested, time spent using the video browser in seconds, and answers to the task and satisfaction questions for each of the four experimental trials.

Video Materials. Video clips were obtained from three Discovery Channel© documentary CD-ROMs: Aquatic Habitats, How the West Was Lost, and Wonders of Weather. Eight 1.5–3.0 minute video clips were selected for this study. Key frames were selected through a combination of methods. Key frames were first selected algorithmically, based on scene changes, using MERIT, a program developed at the Center for Automation Research (CfAR) at the University of Maryland at College Park (Kobla, Doermann, and Rosenfeld, 1996). Then, the 12 key frames per clip used in the study were manually selected from those identified by MERIT. The image files were saved as bitmaps at a resolution of 120x120 (see Figure A.1 in the Appendix for sample key frames).

Experimental Setting and Hardware. Two sessions were arranged in University of Maryland teaching theaters so that multiple users could participate simultaneously. The computers used were IBM-compatible with Intel Pentium microprocessors, 15-in. monitors set at 800x600, and Microsoft Windows 95 operating systems.

Paper-based Forms. VZ-2 by the Educational Testing Service (ETS) is a standard instrument for measuring spatial visualization ability (SVA). The subjective satisfaction questionnaire consisted of four parts and was adapted from the QUIS instrument developed by the Human-Computer Interaction Laboratory (HCIL), University of Maryland at College Park. All of the questions were either short answer, multiple choice, or based on a Likert scale (1–9).



1. Participants were briefed on the goals of the study and asked to read and sign consent forms.

2. Both interface designs, SS and SB, were explained and demonstrated.

3. The assessment of spatial visual abilities (SVA), VZ-2, was administered.

4. Participants were given 30 seconds to view each surrogate type.

5. Two complete sample trials were administered to familiarize participants with the experimental conditions.

6. Four experimental trials, one for each treatment combination, were administered.

  1. Each trial began with an onscreen instruction page.
  2. A 20-second preview of the task was presented: the list of 10 concepts for the GD task and the list of 20 objects for the OR task. (See Figure A2 in the Appendix for a screen shot.)
  3. A selected video surrogate interface was shown. Participants could spend an unlimited amount of time viewing the key frames. Once they were ready to proceed, they were required to click an onscreen button.
  4. Participants were then asked to complete the appropriate task within a 30-second period (see Data Collection/Scoring Protocol section for details).
  5. Immediate subjective satisfaction responses were collected (unlimited amount of time).
8. Participants were debriefed and given an opportunity to ask questions.

9. Participants completed an overall subjective satisfaction questionnaire.

Data Collection/Scoring Protocol. For the Gist Determination (GD) task, participants were asked to select the three phrases that best describe the theme represented from a list of 10 phrases within a 30-second period. Users previewed the same phrases for 20 seconds prior to the trial. The phrases were based on concepts identified by eight people not otherwise involved with the study. These people were asked to watch the six full-motion video clips, with audio, selected for use in the study and to write down phrases or sentences describing key themes for each video. There was no limit on the number of times each clip could be viewed. Content analysis was used to identify the most common concepts. The top 10 words and phrases aggregated from the responses of all eight viewers were used to create a list for each video. In the study, to obtain a scale for scoring, each phrase was assigned a numerical value corresponding to the number of times the word or concept appeared. A performance accuracy score for the GD task was calculated as a ratio of the sum of the values for the three phrases selected over the maximum value that could be obtained. The ratio was later converted into a percentage for statistical analysis. For example, for a particular GD task and video, a participant selected three concepts from the list of 10 choices with values of 5, 5, and 4.The highest score for responses to that video would be 16 (6, 5, and 5). The performance accuracy for that participant for that task would be 14/16 or 87.5%.

For the Object Recognition (OR) task, a list of 20 items, consisting of 10 target objects and 10 distractor objects, is presented to each participant. The authors selected target and distractor objects incorporated into the lists. Criteria for object selection included visibility and how well the objects reflected the theme of the video clip. Distractors were chosen to fit the general theme of the video clip, but were present in any of the key frames. Participants were asked to select the objects they recognized from viewing the surrogate within a 30-second period. The scoring protocol was conducted as follows: one point was given for (1) each correctly identified target object and (2) each distractor object not identified. Thus, a participant who identified all 10 target objects and did not mark any of the 10 distracts received a score of 20 points or 100%. A subject who identified seven target objects and marked four distractors received a score of 13 or 65% (7 points for targets and 6 points for non-marked distractors).


Task and Interface Design -- Performance Measures

For task performance, a 2x2 repeated measures ANOVA (n = 20) resulted in no statistically significant main effects or interaction at the 0.05 level (see Figure 2). ANCOVAs (n = 20) were run to control for variability accounted for by time-to-completion and SVA, respectively. However, no statistically significant effects were found.

Figure 2. Visual browsing performance interaction diagram (n = 20).

Immediate Subjective Satisfaction

For each of the four subjective satisfaction responses elicited immediately after each of the treatments, 2x2 repeated measures ANOVAs were run and the results summarized below. The questions and descriptors are listed in Table 1.

Table 1. Immediate subjective satisfaction questions with descriptors (*statistically significantly results).

# Question Descriptors (Likert scale)
1* Completing the task was… Easy (1) ® Difficult (9)
2 My familiarity with the topic… Unfamiliar (1) ® Expert (9)
3* The display technique for the given task was… Hard to use (1) ® Easy to use (9)
4 The usefulness of the display technique was… Useless (1) ® Useful (9)

For question 1 (n = 18; two participants did not respond), the interface design (slide show vs. storyboard), F(1, 17) = 6.65, p = .019, was found to be statistically significantly different at p < .05 (see Figure 3). For question 3 (n = 18; two participants did not respond), both the interface design (slide show vs. storyboard), F(1, 17) = 10.95, p = .004, and the task type (gist determination vs. object recognition), F(1, 17) = 6.46, p = .021, were found to be statistically significantly different at p < .05 (see Figure 4). None of the other immediate subjective satisfaction questions yielded statistically significant results.

Figure 3. Immediate subjective satisfaction interaction diagram (n = 18) for question 1, "Completing the task was…" using a Likert scale (y-axis). Figure 4. Immediate subjective satisfaction interaction diagram (n = 18) for question 3, "The display technique for the given task was…" using a Likert scale (y-axis).


Overall User Satisfaction (Post-Test)

Participants answered six subjective satisfaction questions (Table 2) for each interface design type (storyboard and slide show). Responses to each pair of questions were compared in paired sample t-Tests (n = 34 for the first four and n = 33 for the last two). All six were found to be statistically significant at the 0.01 level (see Figure 5).

Table 2. Overall subjective satisfaction questions with descriptors (*statistically significantly results).

# Question Descriptors (Likert scale)
1* Overall reactions to the system terrible (1) ® wonderful (9)
2* Overall reactions to the system Frustrating (1) ® satisfying (9)
3* Overall reactions to the system difficult (1) ® easy (9)
4* Overall reactions to the system rigid (1) ® flexible (9)
5* Learning how to operate the system difficult (1) ® easy (9)
6* Can the task be performed in a straightforward manner? never (1) ® always (9)


Figure 5. Overall subjective satisfaction bar chart for six questions with standard deviation bars (n = 34 for #1-4; n = 33 for #5, 6).


The goal of this exploratory study was to determine whether two video browsing designs, storyboard (SB) and slide show (SS), affected performance and subjective satisfaction on two information seeking tasks, gist determination (GD) and object recognition (OR). It was hypothesized that performance with the SS interface would be better than SB for the GD task because SS retained the temporal component of the original video, a potentially important factor in understanding gist. It was also hypothesized that the SB interface would boost performance for the OR task over SS because users could rescan each of the stills for target objects. Furthermore, based on previous studies, it was hypothesized that users would derive greater satisfaction from SB over SS overall, although satisfaction with SS would be greater for GD than OR.

Task and Interface Design -- Performance Measures

User performance resulted in no statistically significant main or interaction effects between the interface design-task type variables. Mean performance accuracy for each treatment was in the mid-70% range. ANCOVAs were conducted to test whether time-to-completion or spatial visual ability might be masking the effects as covariates. Because there was no upper limit to the amount of time that could be spent by users in carrying out the assigned task (i.e., in using a video browsing interface), time-to-completion was considered a covariate. For example, it would be expected that participants who spent a greater amount of time viewing the video browser would have a better score. Spatial visual ability (SVA) is a measure of a person's ability to form mental models of images in three-dimensional space and may also influence understanding narrative or "action" created images in the temporal dimension. For example, participants with higher SVA might perform better with an interface or task requiring "mental manipulation of time" than those with lower SVA. However, controlling for time-to-completion and SVA did not explain any additional variability.

One reason for the lack of statistically significant differences is the small sample size used for the data analysis. Unfortunately, a bug in the test system detected after some trials had been conducted limited the analyzable data to less than half of the participants. Another potential problem was the level of difficulty of the tasks. Ideally the tasks should have provided a wide range of scores to help differentiate any true differences in interface design. However, as implemented, accuracy scores in the mid-70% range seem to indicate that the tasks were too simplistic and not truly representative of the variable to be measured. For example, only eight people were consulted in creating "concept statements" for the GD task. A greater number of people in the "control group" would likely have resulted in more "representative" concept statements.

Immediate Subjective Satisfaction

An advantage of capturing subjective satisfaction immediately after each trial is that the experience is fresh in the participant's mind and more closely reflects initial impressions. Three statistically significant differences were found.

The first, for the question "Completing this task was..." showed that users felt that the slide show (SS) design (overall mean = 6.1) was more difficult than the storyboard (SB) interface (overall mean = 5.0) across both tasks. [Note: the overall scale was 1 (easy) to 9 (difficult), with 5 being the midpoint.] This result is similar to that reported previously by Ding et al. (1997): user satisfaction drops considerably as key frame rate increases in spite of a smaller decrease in user performance. Many users in this study perceived the display rate (3 kfps) to be "too fast".

The other two statistically significant differences were in response to the question "The display technique for the given task was..." For the display designs, SS (overall mean = 6.35) was perceived to be easier to use than SB (overall mean = 4.15). This result contradicts the result found earlier in question 1, where SS was perceived to be more difficult than SB. The most likely explanation is that the results are anomalous, due to the way the question was structured: the lower numbers in the Likert scale corresponded to "hard to use." In question 1, which respondents most likely answered first, the scale was in the opposite direction -- "difficult" corresponded to the higher numbers. Thus, users were probably influenced by question 1 and answered question 3 intending for the higher numbers to indicate greater difficulty. This explanation is consistent with the results from the overall user satisfaction (post-test) questionnaire, where SS was rated "more difficult" than SB in all six of the questions.

Overall User Satisfaction Analysis (Post-Test)

A questionnaire with general demographics information and subjective satisfaction with the different interface types was given at the end of the study to capture participants' overall reactions after experiencing all four treatment conditions. These subjective satisfaction results differ from those mentioned previously in that they reflect user satisfaction after both types of surrogates have been used for both types of tasks, rather than after any single surrogate and task. For each of the six questions, participants consistently found the SB interface statistically significantly "better" (e.g., wonderful, satisfying, easy, flexible, easy-to-learn, and straightforward) than the SS design. Comments that were elicited support these results:

One participant did point out differences in the usefulness of the designs for the different tasks: "Knowing that I had to answer specific questions made the storyboard option more appealing; whereas, simply just browsing around I would prefer the slideshow interface."


In this exploratory study, the storyboard (SB) display was consistently perceived by participants to be more useful and less confusing than the slide show (SS) interface, in spite of the lack of statistically significant differences in task performance. Users found the rapid flipping of images to be distracting and disorienting, despite similar accuracy scores as with the static display. Thus, subjective satisfaction was not only dependent on successful task performance, but also on "comfort" with a particular surrogate type.

One factor accounting for the strong subjective reaction may be that users perceived only glimpses of images in the dynamic display. That is, at a display rate of 3 kfps, each image was on the screen for a third of a second (333 ms). Since recognizing an object under controlled conditions requires at least 100 ms on average (Potter, 1976), no more than three objects could be recognized in a single key frame before being replaced with a new one. 100 ms are only enough time to store information in preattentive memory, just under the threshold of consciousness: As soon as several objects were perceived, viewers would need to reorient themselves to a new set of objects in a new key frame. Thus, even though individual objects could be perceived and recalled, the attentive workload required for constant reorientation was likely to be large and unsatisfying.

Another factor explaining the significant user dissatisfaction with the SS interface could be the lack of user control for key frame rate and/or direction of play. Unlike the SB design, which was static and permitted users to view and review the images under their direction at their own pace, the SS interface "blasted" images at a predefined rate (3 kfps) and direction of play (forward). In addition, participants had to wait for an image to loop around in order to view a particular key frame again. This design clearly violated of one of the primary rules of good interface design, "Support internal locus of control" (Shneiderman, 1998, p. 75).

Thus, although there were no deleterious performance effects in using the SS interface for either of the tasks, such dynamic interfaces would not likely be a good surrogate design for the video browsing tasks tested in this exploratory study due to the user satisfaction results. Further studies are needed to learn how other types of video browsing surrogates affect different user information seeking tasks and how to optimize interface design to satisfy user needs.

Methodological Improvements

Due to technical problems, only 20 participants completed the tasks. Increasing sample size (e.g., n = 60) would provide greater power and be more representative of the population tested. In addition, user tasks need to be improved so that even small effects (i.e., greater task accuracy) could be detected. Finally, although providing user control over the interface (e.g., VCR-like buttons for frame rate speed and direction of play) would increase the complexity of the study, understanding the "efficiency-satisfaction" trade-off would inform future surrogate designs.

Future Research Questions


The authors would like to thank Laura Slaughter and Dr. Kent Norman for their helpful comments. We also thank Ellen Yu Borkowski for help in arranging the use of the teaching theaters, Discovery Channel, Inc. for the use of their video material, and the study participants for their time and effort.


Ding, W., Marchionini, G., & Tse, T. (1997). "Previewing video data: Browsing key frames at high rates using a video slide show interface." In Proceedings of the International Symposium on Research, Development, and Practice in Digital Libraries (ISDL '97) (pp. 151–158). Tsukuba, Japan: University of Library and Information Science.

Elliot, E. (1993). Watch, grab, arrange, see: Thinking with motion images via streams and collages. MSVS thesis, Media Lab, Massachusetts Institute of Technology, Cambridge, MA.

Flickner M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., & Yanker, P. (1995). Query by image and video content: The QBIC system. IEEE Computer, 28(9), 23–32.

Helander, M. (Ed.). (1988). Handbook of human-computer interaction. Amsterdam, Netherlands: North-Holland.

Kobla, V., Doermann, D., & Rosenfeld, A. (1996). Compressed domain video segmentation. (Tech. Rep. CAR-TR-839; CS-TR-3688). College Park: University of Maryland, Center for Automation Research.

Komlodi, A., & Marchionini, G. (1998). "Key frame preview techniques for video browsing." In Proceedings of the ACM Digital Libraries (DL '98) (pp. 118–125). New York: ACM Press.

Marchionini, G. (1995). Information seeking in electronic environments. Cambridge, UK: Cambridge University Press.

O'Connor, B.C. (1991). Selecting key frames of moving image documents: A digital environment for analysis and navigation. Microcomputers for Information Management, 8(2), 119–133.

Potter, M.C. (1976). Short-term conceptual memory for pictures. Journal of Experimental Psychology: Human Learning and Memory, 2(5), 509–522.

Shneiderman, B. (1998). Designing the user interface: Strategies for effective human-computer interaction. (3rd ed.) Reading, MA: Addison Wesley Longman.

Slaughter, L, Shneiderman, B., & Marchionini, G. (1997). "Comprehension and object recognition capabilities for presentations of simultaneous video key frame surrogates." In C. Peters & C. Thanos (Eds.), Research and advanced technology for digital libraries: Proceedings of the first European conference (ECDL '97) (pp. 41–54). Berlin: Springer-Verlag.

Teodosio, L., & Bender, W. (1993). Salient stills from video. In Proceedings of the ACM Multimedia (MM '93), (pp. 39–46). New York: ACM Press.

Tse, T., Marchionini, G., Ding, W., Slaughter, L., & Komlodi, A. (1998). "Dynamic key frame presentation techniques for augmenting video browsing." In Advanced Visual Interfaces (AVI '98) Conference, (pp. 185–194). New York: ACM Press.

Wactlar, H.D., Kanade, T., Smith, M.A., & Stevens, S.M. (1996). Intelligent access to digital video: Informedia project. IEEE Computer, 29(5), 46–52.

Yeo, B.L., and Yeung, M.M. (1997). Retrieving and visualizing video. Communications of the ACM, 40(12), 43–52.

Yeung, M.M., Yeo, B.L., Wolf, W., & Liu, B. (1995). "Video browsing using clustering and scene transition on compressed sequences." In A.A. Rodriguez & J. Maitan (Eds.), Proceedings of the SPIEmultimedia computing and networking, vol. 2417 (pp. 399–413). Bellingham, WA: SPIE Press.

Zhang, H.J., Low, C.Y., & Smoliar, S.W. (1995). Video parsing and browsing using compressed data. Multimedia Tools and Applications, 1, 89–111.


Figure A1. Sample key frames from a video segment.

Figure A2. Screen shot of a task preview screen from test system.

Web Accessibility