Automatic Thumbnail Cropping and its Effectiveness


Bongwon Suh*, Haibin Ling, Benjamin B. Bederson*, David W. Jacobs

Department of Computer Science

*Human-Computer Interaction Laboratory
University of Maryland
College Park, MD 20742 USA

+1 301-405-2764
{sbw, hbling, bederson, djacobs}



Thumbnail images provide users of image retrieval and browsing systems with a method for quickly scanning large numbers of images.  Recognizing the objects in an image is important in many retrieval tasks, but thumbnails generated by shrinking the original image often render objects illegible. We study the ability of computer vision systems to detect key components of images so that intelligent cropping, prior to shrinking, can render objects more recognizable. We evaluate automatic cropping techniques 1) based on a method that detects salient portions of general images, and 2) based on automatic face detection.  Our user study shows that these methods result in small thumbnails that are substantially more recognizable and easier to find in the context of browsingvisual search.



Saliency map, thumbnail, image cropping, face detection, usability study, visual searchimage browsing, zoomable user interfaces



Thumbnail images are now a widely used technique for visualizing large numbers of images given limited screen real estate.  The QBIC system developed by Flickner et al. [10] is a notable image database example. A zoomable image browser, PhotoMesa [3], lays out thumbnails in a zoomable space and lets users move through the space of images with a simple set of navigation functions. PhotoFinder applied thumbnails as a visualization method for personal photo collections [14]. Popular commercial products such as Adobe Photoshop Album [1] and ACDSee [2] also use thumbnails to represent images files in their interfaces.

Current systems generate thumbnails by shrinking the original image. This method is simple. However, thumbnails generated this way can be difficult to recognize, especially when the thumbnails are very small. This phenomenon is not unexpected, since shrinking an image causes detailed information to be lost. An intuitive solution is to keep the more informative part of the image and cut less informative regions before shrinking. Some commercial products allow users to manually crop and shrink images [19]. Burton et al. [4] proposed and compared several image simplification[BBB1]  methods to enhance the full-size images before subsampling. They chose edge-detecting smoothing, lossy image compression, and self-organizing feature map as three different techniques in their work.

In quite a different context, DeCarlo and Santella [8] tracked a user¡¯s eye movements to determine interesting portions of images, and generated non-photorealistic, painterly images that enhanced the most salient parts of the image. Chen et al. [5] use the a visual attention model as a cue to conduct image adaptation for small displays.

In this paper, Wwe study the effectiveness of saliency based cropping methods at for preserving the recognizability of important objects in thumbnails. Our first method is a general cropping method based on the saliency map of Itti and Koch based on human visual attention model à on a model of human visual attention [12][13]. A saliency map of a given image describes the degree of saliency importance of each position in the image.  In our method, we use the saliency map directly as an indication of how much information each position in images contains. The merit of this method is that the saliency map is built up from low-level features only, so it can be applied to general images. We may then select a the portion of the image of maximal informativeness.

Although this saliency based method is useful, it does not consider semantic information in images. We show that semantic information can be used to further improve thumbnail cropping, using automatic face detection. We choose this domain because a great many pictures of interest show human faces, and also because face detection methods have begun to achieve high accuracy and efficiency [20].

In this paper we describe the saliency based cropping algorithm and the face detection based cropping we developed after first discussing related work from the field of visual attention. We then explain the design of a user study that evaluates the thumbnail methods. This paper concludes with a discussion of study our findings and future work.



Visual attention is the ability of biological visual systems to detect interesting parts of the visual input [12][13] [16][17]. The saliency map of an image describes the degree of saliency of each position in the image. The saliency map is a matrix corresponding to the input image and that describes the degree of saliency of each position in the input image.

Itti and Koch [12][13] provided an approach to compute a saliency map for images. Their method first uses pyramid technology to compute three feature maps for three low level features: color, intensity, and orientation. For each feature, saliency is detected when a portion of an image differs in that feature from neighboring regions.  Then these feature maps are combined together to form a single saliency map. After this, in a series of iterations, salient pixels suppress the saliency of their neighbors, to concentrate saliency in a few key points.

Chen et al. [5] proposed using semantic models together with the saliency model of Itti and Koch to identify important portions of an image, prior to cropping. Their method is based on a attention model that uses attention objects as the basic elements. The overall attention value of each attention object is calculated by combining attention values from different models. For semantic attention models they use a face detection technique [15] and a text detection technique [6] to compute two different attention values. The method provides a way to combine semantic information with low-level features. However, when combining the different values, their method uses heuristic weights that are different for five different predefined image types. Images should need to be manually categorized into these five categories prior to applying their method. Furthermore, it heavily relies on the semantic extraction techniques. When the corresponding semantic technique is not available or when the technique failed to provide good result (e.g. no face found in the image), it is hard to expect a good result from the method.



Problem Definition

We define the thumbnail cropping problem as follows: Given an image I, the goal of thumbnail cropping is to find a rectangle RC, containing a subset of the image IC so that the main objects in the image are visible in the subimage. We then shrink IC  to a thumbnail.. In the rest of this paper, we use the word ¡°cropping¡± to indicate thumbnail cropping.

In the next subsection, we propose a general cropping method, which is based on the saliency map and can be applied to general images. Next, a face detection based cropping method is introduced for images with faces.

A General Cropping Method Based on the  Saliency Map

In this method, we use the saliency value to evaluate the degree of informativeness of different positions in the image I. The cropping rectangle RC should satisfy two conditions: having a small size and containing most of the salient parts of the image. These two conditions generally conflict with each other. Our goal is to find the optimal rectangle to balance these two conditions.

An example saliency map is given in Figure 1:


Figure 1: left: original image, right: saliency map of the image shown left

Find Cropping Rectangle with Fixed Threshold using Brute Force Algorithm

We use Itti and Koch¡¯s saliency algorithm because their method is based on low-level features and hence independent ofto semantic information in images.

Once the saliency map SI is ready, our goal is to find the crop rectangle RC that, which is expected to contain the most informative part of the image. Since the saliency map is used as the criteria of importance, the sum of saliency within RC should contain most of the saliency value in SI. Based on this idea, we can find RC as the smallest rectangle containing a fixed fraction of saliency. To illustrate this formally, we define candidates set for RC and the fraction threshold as

Then RC is given by

The RC denotes the minimum rectangle that satisfies the threshold defineds above. A brute force algorithm was developed to compute RC.


Find Cropping Rectangle with Fixed Threshold using Greedy Algorithm

The brute force method works, however, it is not time efficient. Two main factors slow down the computation. First, the algorithm to compute saliency map involves iterations to process a larger filter template[BBB2]  to convolve with the saliency map, which is a very time consuming work. First, the algorithm to compute the saliency map involves several series of iterations. Some of the iterations involve convolutions using very large filter templates (on the order of the size of the saliency map). These convolutions make the computation very time consuming.

Second, the brute force algorithm basically searches all sub-rectangles exhaustively.  While techniques exist to speed up this exhaustive search, it still also takes a lot of time.

We found that we can achieve basically the same results much more efficiently by: 1) using fewer iterations and smaller filter templates during the saliency map calculation; 2) squaring the saliency to enhance it; 3) using a greedy search instead of brute force method by only considering rectangles that include the peaks of the saliency.

Figure 2 shows the algorithm GREEDY_CROPPING to find the cropping rectangle with fixed saliency threshold . The greedy algorithm calculates RC by incrementally including the next most salient peak point P. Also when including a salient point P in RC, we union RC with a small rectangle centered at P. This is because if P is within the foreground object, it is expected that a small region surrounding P would also be withincontain the object. When we initialize RC we assume that the center of the input saliency map always falls in RC. This is reasonable, since even when the most salient part does not contain the center (this rarely happens possibility is very small), it will not create much harm to our purpose of thumbnail generation. With this assumption, we initialize RC to contain the center of the input saliency map.


thresholdSum ©¬  * Total saliency value in S

RC  ©¬ the center of S

currentSaliencySum ©¬ saliency value of RC

WHILE currentSaliencySum < thresholdSum DO

    P ©¬ Maximum saliency point outside RC

    R¡¯ ©¬ Small rectangle centered at P

    RC ©¬ UNION(RC, R¡¯)

    UPDATE currentSaliencySum with new region RC



Figure 2: aAlgorithm to find cropping rectangle with fixed saliency threshold. S is the input saliency map andis the threshold.


Find Cropping Rectangle with Dynamic Threshold

Through experiments we found that effective thresholds can vary within a range. Now the problem is to determine the threshold .  Replace this whole paragraph with: ¡®Experience shows that the most effective threshold varies from image to image.  We therefore have developed a method for adaptively determining the threshold .

Intuitively, we want to choose a threshold at a point of diminishing returns, where adding small amounts of additional saliency requires a large increase in the rectangle.  We use an the area-threshold graph to visualize this.  The X axis indicates the threshold (fraction of saliency) while the Y axis shows the normalized area of the cropping rectangle as the result of the greedy algorithm mentioned above. Here the normalized area has a value between 0 and 1. The solid curve in Figure 3 gives an example of an area-threshold graph.

A natural solution is to use the threshold with maximum gradient in the area-threshold graph. We approximate this using a binary search method to find the threshold in three steps: First, we calculate the area-threshold graph for the given image. Second, we use a binary search method to find the threshold, where the graph goes up quickly. Third, the threshold is tuned back to the position where a local maximum gradient exists. The dotted lines in Figure 3 demonstrate the process of finding the threshold for the image given in Figure 1.

Figure 3: The solid line represents the area-threshold graph. The dotted lines show the process of searching for the best threshold. The numbers indicate the sequence of searching


Examples of Saliency Map Based Cropping

After getting RC, we can directly crop from the input image I. The example tThumbnails of the image given in Figure 1 are shown in Figure 4. It is clear from Figure 4 that the cropped thumbnail can be more easily recognized than the thumbnail without cropping.


Figure 4 (left)Left: the image cropped based on the saliency map; (middle): the cropping rectangle which contains most of the saliency parts; (right uptop): a thumbnail subsampled from the original image; (right bottom): a thumbnail subsampled from the cropped image (left part of this figure).

Figure 5 shows the result of an image whose salient parts are more scattered. Photos focusing primarily on the subject and without much background information often have this property. A merit of our algorithm is that it is not sensitive to this.


Figure 5 (left top): the original image (courtesy of Corbis [7]); (right top): the saliency map; (left bottom): the cropped image; (right bottom): the cropped saliency map which contains most of the saliencty parts.

Face Detection Based Cropping

In the above section, we proposed a general method for thumbnail cropping. The method relies only on low-level features. However, if our goal is to make the objects of interest in an image more recognizable, we can clearly do this more effectively when we are able to automatically detect the position of these objects.

Images of people are essential in a lot of research and application areas, such as the area of vision-based human computer interaction. At the same time, face processing is a rapidly expanding area and has attracted a lot of research effort in recent years. Face detection is one of the most important problems in the area. There are numerous methods proposed for face detection [20].   Change to ¡®[20] surveys the numerous methods proposed for face detection.

For human image thumbnails, we claim that recognizability will increase if we crop the image to contain only the face region. Based on this claim, we designed a thumbnail cropping approach based on face detection. First, we identify faces by applying CMU¡¯s on-line face detection [9][18] for to the given images. Then, the cropping rectangle RC is computed as containing all the detected faces. After that, the thumbnail is generated from the image cropped from the original image by RC.

Figure 6 (left): the original image; (middle): the face detection result from CMU¡¯s online face detection [9]; (right): the cropped image based on the face detection result.

Figure 6 shows an example image, its face detection result and the cropped image. Figure 7 shows the three thumbnails generated via three different methods. In this example, we can see that face detection based cropping method is a very effective way to create thumbnails, while saliency based cropping produces little improvement because the original image has few non-salient regions to cut.


Figure 7: Thumbnails generated by the three different methods. (Lleft): without cropping; (middle): saliency  based cropping; (right): face detection based cropping.


We ran a controlled empirical study to examine the effect of different thumbnail generation methods on the ability of users to recognize objects in images.  The experiment is divided into two parts. First, we measured how recognition rates change depending on thumbnail size and thumbnail generation techniques. Participants were asked to recognize objects in small thumbnails (Recognition Task). Second, we measured how the thumbnail generation technique affects browsing search performance (Visual SearchBrowsing Task). Participants were asked to find images that match given descriptions. The total duration of each experiment was about 45 minutes.

Design of Study

The recognition tasks were designed to measure the successful recognition rate of thumbnail images on three conditions, image set, thumbnail technique, and thumbnail size. We measured the correctness as a dependent variable.

The browsing visual search task conditions were designed to measure the effectiveness of image browsing search with thumbnails generated with different techniques. The experiment employed a 3xX3 within-subjects factorial design, with image set and thumbnail technique as independent variables. We measured browsing search time as a dependant variable. But, since the face-detection clipping is not applicable to the Animal Set and the Corbis Set, we omitted the browsing visual search tasks with those conditions as in Figure 8. The total duration of the experiment for each participant was about 45 minutes.

Thumbnail Technique

Animal Set

Corbis Set

Face Set

Plain shrunken thumbnail




Saliency based cropping




Face detection based cropping




Figure 8:  Browsing Visual search task design.  Checkmarks (¡î) show which image sets were tested with which image cropping techniques.


There were 20 participants in this study. Participants were college or graduate students at the University of Maryland at College Park recruited on the campus. All participants were familiar with computers. Before the tasks begain, all participants were asked to pick ten familiar persons out of fifteen candidates. Two participants had difficulty with choosing them. Since the it is mandatory for participants must recognize the people to know persons whose images are used for identification, the results from those two participants were excluded from the analysis.

Image Sets

We used three image sets for the experiment. We also used filler images as distracters to minimize the duplicate exposure of images in the browsing visual search tasks. There were 500 filler images and images were randomly chosen from this set as needed. These images were carefully chosen so that none of them was were similar to images in the three test image sets.

Animal Set (AS)

The ¡°Animal Set¡± includes images of ten different animals and there are five images per animal. All images were gathered from various sources of the Web. The reason we chose animals as target image was to test recognition and browsing visual search performance of familiar objects. The basic criteria of choosing animals were 1) that the animals should be very familiar so that participants can recognize them without prior learning; and 2) they should be easily distinguishable from each other. As an example, donkeys and horses are too similar to each other. To prevent confusion, we picked horse onlyonly used horses.

Corbis Set (CS)

Corbis is a well known source for digital images and keeps provides various types of tailored digital photos [7]. Its images are professionally taken and manually cropped. The goal of this set is to represent images already in the best possible shape. We randomly selected 100 images out of 10,000 images. We used only 10 images as search targets for visual browsing search tasks to reduce the experimental errors. But during the experiment, we found that one task was problematic because there were very similar images in the fillers and sometimes participants picked unintended images as an answer, which we could not wrong. Therefore we discarded the result from the task. A total of five observations were discarded due to this condition.

Face Set (FS)

This set includes images of fifteen well known persons people who are either politicians or entertainers. Five images per person are were used for this experiment. All images were gathered from the Web. We used this set to test the effectiveness of face detection based cropping technique and to see how the participants¡¯ recognition rate varies on with different types of images.

Some images in this set contained more than one face. In this case, we cropped the image so that the resulting image contains all the faces in the original image. Out of 75 images, multiple faces were detected in 25 images. We found that 13 of them contained erratic detections. All erroneously detectedatic faces were also included in the cropped thumbnail sets since we intended to test our cropping method with available face detection techniques, which are not perfect.

Thumbnail Techniques

Plain shrinking without cropping

The images were scaled down to smaller dimensions. We prepared ten levels of thumbnails from 32 to 68 pixels in the larger dimension. The thumbnail dimension size wasis increased by four pixels per level. But, for the Face Set images, we increased the number of levels to twelve because we found that some faces are not identifiable even in a 68 pixel thumbnail. When testing with face images, we used twelve levels instead.

Saliency based cropping

By using the saliency based cropping algorithms described above, we cropped out background of  the images. Then we shrunk cropped images to ten sizes of thumbnails. Figure 8 shows how much area was cropped out for each technique.

Cropping Technique and Image Set



Saliency based cropping

Corbis Set



Animal Set



Face Set






Face detection based cropping (Face Set)



Figure 9 : Ratio of cropped to original image sizes.


Face detection based cropping

Faces were detected by CMU¡¯s algorithm as described above. If there were multiple faces detected, we chose the bounding region that contains all detected faces. Then twelve levels of thumbnails from 36 to 80 pixels were prepared for the experiment.


Recognition Task

We used the ¡°Animal Set¡± and the ¡°Face Set¡± images to measure how accurately participants could recognize objects in small thumbnails. First, users were asked to identify animals in thumbnails. The thumbnails in this task were chosen randomly from all levels of the Animal Set images. This task was repeated 50 times.



[BBB3] Figure 10: Recognition task interfaces. Participants were asked to click what they saw or ¡°I¡¯m not sure¡± button. Left: Face Set recognition interface, Right: Animal Set recognition interface

When the user clicked the ¡°Next¡± button, a thumbnail was shown as in Figure 10 for two seconds. Since we intended to measure pure recognizability of thumbnails, we limited the time thumbnails were shown. According to our pilot user study, users tended to guess answers even though they could not clearly identify objects in thumbnails when they saw them for a long time. To discourageprevent  participants¡¯ from guessingwrong guesses, the interface was designed to make thumbnails disappear after a short period of time, two seconds. For the same reason, we introduced more animals in the answer list. Although we used only ten animals in this experiment, we listed 30 animals as possible answers as seen in Figure 10, to limit the subject¡¯s ability to guess identity based on crude cues. In this way, participants were prevented from choosing similarly shaped animals by guess. For example, when participants think that they saw a bird-ish animal, they would select swan if it is the only one avian animal. By having multiple fowls birds in the candidate list, we could prevent those undesired behaviors.

After the Animal Set recognition task, users were asked to identify a person in the same way. This Face Set recognition task was repeated 75 times. In this session, the candidates were shown as portraits in addition to names as seen in Figure 10.


Browsing Visual Search Task

For each testing condition in Figure 8, participants were given we allocated two tasks. Thus, for each browsing visual search session, fourteen browsing search tasks were assigned per participant. The order of tasks was randomized to remove reduce the learning effects.

As shown in Figure 11, participants were asked to find an one image among 100 images. For the browsing visual search task, it was important to provide equal browsing search conditions for each task and participant. To ensure fairness, we designed the browsing search condition carefully. We suppressed the duplicate occurrences of images and manipulated the locations of the target images.

For the Animal Setbrowsing  search tasks, we randomly chose one target image out of 50 Animal Set images. Then we carefully selected 25 non-similar looking animal images. After that we mixed them with 49 more images randomly chosen from the filler set as distracters. For the Face Set and Corbis Set tasks, we prepared the task image sets in the same way.

The tasks were given as verbal descriptions for the Animal Set and Corbis set tasks. For the Face Set tasks, a portrait of a target person was given as well as the person¡¯s name. The given portraits were separately chosen from an independent collection so that they were not duplicated with images used for the tasks.

Figure 11:  Browsing Visual search task interface. Participant were asked to find an image that matches a given task description. Users can zoom in, zoom out, and pan freely until they find the right image.

We used a custom-made image browser based on PhotoMesa [3] as our visual search interfaceimage browser. PhotoMesa provides a zooming environment for image navigation with a simple set of control functions. Users can click the left mouse button to zoom into a group of images (as indicated by a red rectangle) to see the images in detail and click the right mouse button to zoom out to see more images to overview. Panning is supported either by mouse dragging or arrow keys. PhotoMesa can display a large number of thumbnails in groups on the screen at the same time. Since this user study was intended to test pure visual searchimage browsing, all images were presented in a single cluster as in Figure 11.

Participants were allowed to zoom in, zoom out and pan freely for navigation. When users identify the target image, they were asked to zoom into the full scale of the image and click the ¡°Found it¡± button located on the upper left corner of the interface to finish the task. Before the browsing visual search session, they were given as much time as they wanted until they found it comfortable to use the zoomable interface. Most participants found it very easy to navigate and reported no problem with the navigation during the session.



Figure 12 shows the results from the recognition tasks. The horizontal axis represents the size of thumbnails and the vertical axis denotes the recognition accuracy. Each data point in the graph denotes the successful recognition rate of the thumbnails at that level. As shown, the bigger the thumbnails are, the more accurately participants recognize objects in the thumbnails. And thisit fits well with ourthe intuition we have remove ¡®we have¡¯. But the interesting point here is that the automatic cropping techniques perform significantly better than the original thumbnails.

Figure 12: Recognition Task Results. Dashed lines are interpolated from jagged data points.

There were clear correlations in the results. Participants recognized objects in bigger thumbnails more accurately regardless of the thumbnail techniques. Therefore, we used Paired T-test (two tailed) to analyze the results. The results are shown in Figure 13.

The first graph shows the results from the ¡°Animal Set¡± with two different thumbnail techniques, noNo cropping and saliency based cropping. As clearly shown, users were able to recognize objects more accurately with saliency based cropped thumbnails than with plain thumbnails with no cropping. One of the major reasons for the difference can be attributed to the fact that the effective portion of images is drawn relatively larger in saliency based cropped images. But, if the main object region is cropped out, this would not be true. In this case, the users would see more non-core part of images and the recognition rate of the cropped thumbnails would be less than that of plain thumbnails. The goal of this test is to measure if saliency based cropping cut out the right part of images. The recognition test result shows that participants recognize objects better with saliency based thumbnails than plain thumbnails. Therefore, we can decide say that saliency based cropping cut out the right part of images.




P value

No cropping vs. Saliency based cropping on Animal Set



No cropping vs. Saliency based cropping on Face Set



No cropping vs. Face Detection based cropping on Face Set


< 0.001

Saliency based cropping vs. Face detection based cropping on Face Set


< 0.001

Animal Set vs. Face Set with no cropping



Animal Set vs. Face Set with saliency based cropping



Figure 13: Analysis results of Recognition Task (Paired T-Test). Every curve in Figure 12 is significantly different from each other.[BBB4] 

During the experiment, participants mentioned that the background sometimes helped with recognition. For example, when they saw blue background, they immediately suspected that the images would be about sea animals. Similarly, the camel was well identified in every thumbnail technique even in very small scale thumbnails because the images have unique desert backgrounds (4 out of 5 images).

  Since saliency based cropping cuts out large portion of background (42.4%), we suspected that this might harm recognition. But the result shows that it is not true. Users performeds better equally well with cropped images. Even when background was cut out, users still could see some of background and they got enough help from the information. It implies that the saliency based cropping is well balanced. The cropped image shows main objects bigger while giving enough background information.

The second graph shows similar results similar to as the first. The second graph represents the results from the ¡°Face Set¡± with three different types of thumbnail techniques, no cropping, saliency based cropping, and face detection based cropping. As seen in the graph, participants perform much better with face detection based thumbnails. It is not surprising that users can identify a person more easily with images with bigger faces.

Compared to the Animal Set result, the Face Set images are less accurately identified. This is because humans have similar visual characteristics while animals have more distinguishing features. In other words, animals can be identified with overall shapes and colors but humans cannot be distinguished easily with those features.  The main feature that distinguishes humans is the face. The experimental results clearly show that participants recognized persons better with face detection based thumbnails.

The results also show that saliency cropped thumbnails is useful for recognizing humans just like identifying ¡®just like identifying¡¯ à ¡®as well as¡¯as well as animals. We found that saliency based cropped images include persons in the photos so that persons in the images can be presented larger in cropped images. The test results show that the saliency based cropping does increase the recognition rate.

In this study, we used two types of image sets and three different thumbnail techniques. For ¡®For¡¯ à ¡®To achieve a¡¯ To achieve a higher recognition rate, it is important to show major distinguishing features. If well cropped, small sized thumbnail would be enough for representing ¡®enough for representing¡¯ à ¡®sufficient to represent¡¯sufficient to represent the whole images. Face detection based cropping shows benefits when this type of the remove ¡®the¡¯ feature extraction is possible. But, in a real image browsingimage browsing task, it is not always possible to acknowledge ¡®acknowledge¡¯à¡¯know¡¯ know users¡¯ searching intention. For the same image, users¡¯ focus might be different for browsingbrowsing  purposes. For example, users might want to find a person at some point, but the next time, they would like to focus on costumes only. We believe that the sSaliency based cropping technique can be applied in mostgeneral cases when semantic object detection is not available or users¡¯ search behavior is not known..

The cropped images usually contain the main objects in images as well as enough  ¡®enough¡¯à sufficient¡¯ background information.  I had the feeling this point was repeated a little too often throughout the paper.

In addition, the recognition rate is not the same for different types of images. It implies that the minimum recognizable size should be different depending on image types.



Figure 13 shows the result of the browsing visual search tasks.  Most participants were able to finish the tasks within the 120 seconds ¡®seconds¡¯à¡¯second¡¯ timeout (15 timeoutsed out of 231 tasks) and also chose the desired answer (5 wrong answers out of 231 tasks). Change sentence to ¡®Wrong answers and timed out tasks were excluded from the analysis¡¯.  Actually, this worries me a little since a timed out task indicates the task was hard.  If you¡¯re excluding more timed out tasks for some types of thumbnails than others, won¡¯t this skew the results?  Wrong answers were excluded for the analysis as well as timed outed tasks.

A two way analysis of variance (ANOVA) was conducted on the browsing search time for two conditions, thumbnail technique and image sets. As shown, participants found the answer images faster with cropped thumbnails. Overall, there was a strong difference for visual browsing search performance depending to thumbnail techniques, F(2, 219) = 5.58, p = 0.004.

Since we did not look at there were no result from face detection cropping for I think it would sound better to always say: ¡®the Animal/Corbis Set¡¯, ie., to put ¡®the¡¯ in front of the set.  the Animal Set and the Corbis Set, we did another analysis only with the two thumbnail techniques (plain thumbnail, saliency based cropped thumbnail) to see if the saliency based algorithm is better. The result shows a significant improvement on image visual searchbrowsing with saliency based cropping, F(1, 190) = 3.823, p = 0.05. We therefore believe that the think that proposed saliency based cropping algorithm make a significant contribution to image visual searchbrowsing.

Figure 14: Browsing Visual search task results.



F value

P value

Thumbnail techniques on three sets



Thumbnail techniques on Face Set



No cropping vs. Saliency based thumbnail on three image sets



Three image sets regardless of thumbnail techniques



Figure 15 List of ANOVA results from the browsing visual search task

When the results from the Face Set alone were analyzed by one way ANOVA with three thumbnail technique conditions, there also was a significant effect, F(2, 87)=4.56, p = 0.013. But for the Animal Set and the Corbis Set, there was only a borderline significant effect we could not observe meaningful differences over different techniques. We think that  this the reason that there was a weak result for each subset is due to the small number of observations. We believe those results would also be significant if there were more participants because there was a clear trend showing an improvement of 18% on the Animal Set and 24% on the Corbis Set. Another reason can ¡®Another reason can¡¯à ¡®Lack of significance can also¡¯ be attributed to the fact that the browsing search task itself has big large variances by its nature. We found that the location of answer images affects the browsing visual search performance. Users can begin to look up for images from anywhere in the image space (Figure 11).  Participants scanned the image space from the upper-left corner, from the lower-right corner, or sometimes randomly. If the answer image is located in the initial position of users¡¯ attention, it would be found much earlier. Since we could not control users¡¯ behavior, we randomized the location of the answer images. But as a result of it, there was large variance was inevitably smeared into the result.

Before the experiment, we were afraid that the cropped thumbnails of the Corbis Set images would affect the browsing search result negatively since the images in the Corbis Set are already in good shape and we were not sure ¡®not sure¡¯à¡¯concerned¡¯ concerned that the cutting off their background would rather harm participants¡¯ browsingvisual search. But according to our result, saliency based cropped thumbnails does not affect ¡®affect¡¯à¡¯harm¡¯ harm users¡¯ browsingvisual search. It rRather, it showed a tendency to increase participants¡¯ browsing search performance. We think that this is because saliency based cropping algorithm cut the right amount of information without removing core information the images have ¡®the images have¡¯à¡¯in the images¡¯in the images. At least, we can conclude that it did not make image visual searchbrowsing worse to use the cropped thumbnails.

Another interesting thing we found is that the browsing visual search task with the Animal Set tends to take less time than with the Corbis Set and the Face Set, F(2, 219) = 2.44, p = 0.089. It ¡®It¡¯à¡¯This¡¯ This might be because the given Corbis Set and Face Set tasks were harder than the Animal Set. But we think there is another interesting factor. During the experiment, when he found the answer image after a while, one participant said that ¡°Oh¡¦ This is not what I expected. I expected blue background when I¡¯m supposed to find an airplane.¡± Since one of the authors was observing the experiment session, it was observed that the participant passed away over the correct answer image during the browsing search even though he saw the image at reasonably big scale. Since all of the browsing visual search tasks except finding faces were given as verbal descriptions, users did not have any information about how what the answer images would be like. We think that this verbal description was one of the factors to in performance differences between image sets. We found that animals are easier to find by guessing background than other image sets.



We developed and evaluated two automatic cropping methods. A general thumbnail cropping method based on a saliency model finds the informative portion of images and cuts out the non-core part of images. Thumbnail images generated from the cropped part of images increases users¡¯ recognition and helps users in image visual searchbrowsing. This technique is general and can be used without any prior assumption on ¡®on¡¯à¡¯about¡¯ about images since it uses only low level features. Furthermore, it also can be used for images already in good shape[BBB5] . Since it dynamically decides how much the ratio of region to cut away, it can prevent from cutting out too muchs.

The Fface detection based cropping technique shows an example of how semantic information inside images can be used to enhance browsing performancethumbnail cropping. With a face detection technique, we could created more effective thumbnails, which can significantly increaseds users¡¯ browsing recognizing and finding performance.

Our study shows strong empirical evidence that supports our hypothesess about couple of inferences we made. We assumed that the more salient a portion of image, the more informative it isis the part. And wWe also presumed that using more recognizable thumbnails would increase browsing visual search performance. We believe that further researches would reveal intrinsic nature of their relationships and limitation.

Another finding of interest is that users tend to have a set of mental models about a browsingsearch targets. As stated above, users tend to develop a model about what a target will look would be like by guessing its color and shape. We were able to notice that many participants began to guess about what the target would be like when they were asked to browse images. It  Wewas observed that they rather spent a long time browsing searching or even skipped the correct answer when their guesses were wrong or they were unable to guess. It is known that humans to have an ¡°attentional control setting¡± – a mental setting about what they are (and are not) looking for while performing a given task. Interestingly, it is also known that humans have difficulty in switching their attentional control setting instantaneously [11].  This theory well explains our our observations. We think that this phenomenon should be regarded in designing image browsing interfaces especially in situations where that users need to skim a large number of images.

In addition, tThere are several interesting directions for future research. One direction involves determining how to apply these techniques to other more general browsing environments. In our study, we used a zoomable interface for image visual searchbrowsing. We believe that the image cropping our techniques presented in this paper can benefit other types of interfaces which ¡®which¡¯à¡¯that¡¯ that deal with a large number of images as well.  Another interesting direction of interest would be to combine image simplification[BBB6]  adaptation techniques (i.e. saliency based smoothing) with the image cropping techniques. This would allow faster thumbnail processing and delivery for thumbnail-based retrieval systems.



We would like to acknowledge the face group at Carnegie Mellon University for providing resources for face detection processing.



1.        ACDSee, ACD Systems,

2.        Adobe Photoshop Album, Adobe Systems Inc.,

3.       Bederson, B. B. PhotoMesa: A Zoomable Image Browser Using Quantum Treemaps and Bubblemaps. UIST 2001, ACM Symposium on User Interface Software and Technology, CHI Letters, 3(2), pp. 71-80. 2001

4.       Burton, C., Johnston, L., and Sonenberg, E. Case Study: An Empirical Investigation of Thumbnail Image Recognition, Proceedings on Information Visualization, Atlanta, Georgia, pp115-121, 1995.

5.       Chen, L, Xie, X., Fan, X., Ma, W., Zhang, H., and Zhou, H. (2002). A Visual attention model for adapting images on small displays, MSR-TR-2002-125, Microsoft Research, Redmond, Washington..

6.       Chen, X., and Zhang, H. Text Area Detection from Video Frames. In Proc. 2nd IEEE Pacific-Rim Conferencef. on Multimedia (PCM2001), October 2001, Beijing, China, pp. 222-228

7.       Corbis,

8.       DeCarlo, D., and Santella, A. Stylization and Abstraction of Photographs, In ACM SIGGRAPH 2002, pp. 769-776.

9.       Face Detection Demonstration. Robotics Institute, Carnegie Mellon University

10.     Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., and Yanker, P. Query by Image and Video Content: The QBIC System, IEEE Computer, Volume: 28, Issue: 9 , Sept. 1995 pp.23 -32.

11.     Folk, C.L., Remington, R.W., and Johnston, J.C. Involuntary covert orienting is contingent on attentional control settings. J. ournal of Experimental. Psychology: HP&P, 18:1030-44, 1992.

12.    Itti, L., and Koch, C. A Comparison of Feature Combination Strategies for Saliency-Based Visual Attention Systems, SPIE human vision and electronic imaging IV(HVEI¡¯99), San Jose, CA, pp473-482.

13.    Itti, L., Koch, C., and Niebur, E., A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), pp. 1254-9, 1998.

14.    Kang, H., and Shneiderman, B.  Visualization Methods for Personal Photo Collections: Browsing and Searching in the PhotoFinder,  In Proc. Of IEEE International Conference on Multimedia and Expo (ICME2000) New Yorak: IEEE, pp. 1539-1542

15.    Li, S., Zhu, L., Zhang, Z., Blake, A., Zhang, H., and Shum, H. Statistical Learning of Multi-view Face Detection. In ECCV European Conference on Computer Vision (4) 2002: 67-81

16.    Milanese, R., Wechsler H., Gil S., Bost J., and Pun T. Integration of Bottom-Up and Top-Down Cues for Visual Attention Using Non-Linear Relaxation, Proc of CVPR Computer Vision and Pattern Recognition, IEEE. 1994, 781-785.

17.    Milanese, R.  Detecting Salient Regions in an Image: from Biological Evidence to Computer Implementation, Ph.D. thesis, Univ. of Geneva, 1993.

18.     Schneiderman, H., and Kanade, T.  A Statistical Model for 3D Object Detection Applied to Faces and Cars.
 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, June, 2000.

19.    Vimas Technologies.

1.     Yang, M., Kriegman, D., and Ahuja, N. Detecting Faces in Images: A Survey, IEEE Transactions on Pattern  Analysis and Mach Intelligence, 24(1), pp. 34-58, 2002.




 [BBB1]I don¡¯t know what ¡®simplification¡¯ means in this context.  This work must be summarized in a sentence or two.

 [BBB2]I don¡¯t understand this sentence.  Does it mean that successively larger filters are used iteratively?  Or does it mean that one filter is used that is larger than some other filter?  Be specific, and if the latter, explain what it is larger than.

 [BBB3]These two images are poor quality – the text is not readable, even when zoomed in.  Please regrab higher resolution images.

 [BBB4]I don¡¯t understand the value of presenting the last two entries in this table.  What is meaning of comparing two different image sets?  It is not referred to in the text.  I think they should be removed from the table unless they are important.


I don¡¯t think you are allowed to use t-tests to compare the face set results.  Since there are three conditions, I think you are obligated to use an ANOVA.  You can only use a t-test to compare individual results as a post-hoc step if an ANOVA says there is a significant difference between all three – right?

 [BBB5]I don¡¯t understand what this means – ¡°possible shape¡±.  Please explain.

 [BBB6]Explain ¡®simplification¡¯ briefly.  Something like: ¡°Another interesting direction would be to combine image simplification techniques (i.e., xxx)¡¦¡±

Web Accessibility