Disentangling Visual Embeddings for Attributes and Objects

Nirat Saini

Khoi Pham

CVPR 2022 (Oral)

We disentangle visual embeddings for peeled and apple from peeled apple

For image I: peeled apple, two other images are used:

with same object I_obj ( sliced apple )
with same attribute I_attr ( peeled orange )

to extract the visual similarity and dissimilarity features for attribute and objects.

After disentangling visual features for attributes and objects individually, we can hallucinate seen and unseen pairs, peeled apple and sliced orange respectively, for better generalization.

^*Note that this is a visualization of embedding space composition, we do not generate images.

Object Attribute Disentanglement (OADis)

Click here for details

Using similarity and dissimilarity between pair of images, we disentangle the vidusl embeddings for attributes and objects and hallucinate compositions in the embedding space using linguistic losses for regularization:

L_cls: classifies the image for correct attribute and object pair using Label Embedder
L_attr: separates the visual attribute feature from images (peeled)
L_obj: separates the object's visual feature from images (apple)
L_seen: regularizes the hallucinated visual feature composition for seen pair (peeled apple)
L_unseen: regularizes the hallucinated visual feature composition for unseen pair (sliced orange)

Qualitative Results

Retrieval of nearest 5 images using hallucinated compositions of unseen attributes and objects

MIT-States Results

Using the hallucinated composition of unseen attribute and object embeddings, we retrieve 5 nearest neighbors. For example, in first row, images are shown for sliced fruit. Images with incorrect labels are in red.

UT-Zappos Results

Using the hallucinated composition of unseen attribute and object embeddings, we retrieve 5 nearest images. Similar to left, images with incorrect labels are in red.

Visualizing Attribute Masks

Visualization Results

(a) Failure Cases: For each image pair, it is not always straightforward to vide the attribute and objectness with a mask. We show for some cases for MIT-States, the examples are very vague or incorrect to actually capture attribute and object concepts separately. For instance, in clear lake and clear sky, it is very difficult to distinguish lake and sky. Hence the similarity and dissimilarity maps do not perform very well.(b) Correct Examples: This shows some good examples, where the similarity and dissimilarity maps capture the attibuteness and objectness in the image pairs correctly for MIT-States. You can try and see more examples of attributes masks and object masks below.

Choose an Attribute

Choose an Object

Paper, Code, and Supplementary Material

N. Saini, K. Pham, A. Shrivastava
Disentangling Visual Embeddings for Attributes and Objects.
CVPR 2022 (Oral).

Paper / Supplementary / Code / Bibtex

Acknowledgements

This work was supported by the Air Force (STTR awards FA865019P6014, FA864920C0010), DARPA SAILON program (W911NF2020009) and gifts from Adobe collaboration support fund.