Disentangling Visual Embeddings for Attributes and Objects

Nirat Saini
Khoi Pham
Abhinav Shrivastava

CVPR 2022 (Oral)

We disentangle visual embeddings for peeled and apple from peeled apple

For image I: peeled apple, two other images are used:

to extract the visual similarity and dissimilarity features for attribute and objects.

After disentangling visual features for attributes and objects individually, we can hallucinate seen and unseen pairs, peeled apple and sliced orange respectively, for better generalization.

*Note that this is a visualization of embedding space composition, we do not generate images.



Object Attribute Disentanglement (OADis)


Click here for details

Using similarity and dissimilarity between pair of images, we disentangle the vidusl embeddings for attributes and objects and hallucinate compositions in the embedding space using linguistic losses for regularization:

  • Lcls: classifies the image for correct attribute and object pair using Label Embedder
  • Lattr: separates the visual attribute feature from images (peeled)
  • Lobj: separates the object's visual feature from images (apple)
  • Lseen: regularizes the hallucinated visual feature composition for seen pair (peeled apple)
  • Lunseen: regularizes the hallucinated visual feature composition for unseen pair (sliced orange)

Qualitative Results

Retrieval of nearest 5 images using hallucinated compositions of unseen attributes and objects

ut-zap
MIT-States Results

Using the hallucinated composition of unseen attribute and object embeddings, we retrieve 5 nearest neighbors. For example, in first row, images are shown for sliced fruit. Images with incorrect labels are in red.

ut-zap
UT-Zappos Results

Using the hallucinated composition of unseen attribute and object embeddings, we retrieve 5 nearest images. Similar to left, images with incorrect labels are in red.

Visualizing Attribute Masks

ut-zap
Visualization Results

(a) Failure Cases: For each image pair, it is not always straightforward to vide the attribute and objectness with a mask. We show for some cases for MIT-States, the examples are very vague or incorrect to actually capture attribute and object concepts separately. For instance, in clear lake and clear sky, it is very difficult to distinguish lake and sky. Hence the similarity and dissimilarity maps do not perform very well.(b) Correct Examples: This shows some good examples, where the similarity and dissimilarity maps capture the attibuteness and objectness in the image pairs correctly for MIT-States. You can try and see more examples of attributes masks and object masks below.

Paper, Code, and Supplementary Material

N. Saini, K. Pham, A. Shrivastava
Disentangling Visual Embeddings for Attributes and Objects.
CVPR 2022 (Oral).


Paper / Supplementary / Code / Bibtex




Acknowledgements

This work was supported by the Air Force (STTR awards FA865019P6014, FA864920C0010), DARPA SAILON program (W911NF2020009) and gifts from Adobe collaboration support fund.