GRIT: GAN Residuals for Paired Image-to-Image Translation

Saksham Suri*
Moustafa Meshry*
Larry Davis
Abhinav Shrivastava

WACV 2024

[Paper]


We decouple the optimization of reconstruction and adversarial losses by synthesizing an image as a combination of its reconstruction (low-frequency) and GAN residual (high-frequency) components. The GAN residual adds realistic fine details while avoiding the pixel-wise penalty imposed by reconstruction losses.


Abstract

Current Image-to-Image translation (I2I) frameworks rely heavily on reconstruction losses, where the output needs to match a given ground truth image. An adversarial loss is commonly utilized as a secondary loss term, mainly to add more realism to the output. Compared to unconditional GANs, I2I translation frameworks have more supervisory signals, but still their output shows more artifacts and does not reach the same level of realism achieved by unconditional GANs. We study the performance gap, in terms of photo-realism, between I2I translation and unconditional GAN frameworks. Based on our observations, we propose a modified architecture and training objective to address this realism gap. Our proposal relaxes the role of reconstruction losses, to act as regularizers instead of doing all the heavy lifting which is common in current I2I frameworks. Furthermore, our proposed formulation decouples the optimization of reconstruction and adversarial objectives and removes pixel-wise constraints on the final output. This allows for a set of stochastic but realistic variations of any target output image.


Approach Overview


Overview of GRIT. Our network generates the output as the composition of a reconstruction component and a GAN-residual component. An L1 reconstruction loss is applied only to the reconstruction component, while the GAN residual is supervised only through an adversarial loss. Right: The generator's upsampling block. We feed the encoded style latent through AdaIN layers, and also add random spatial noise maps controlled by learnable weights W to the feature maps.



Qualitative Analysis


Qualitative comparison of GRIT with recent approaches on CelebAMask-HQ dataset.



Examples of local stochastic variations. Top to bottom rows represent the input image, one sample output, standard deviation of each pixel over 20 different outputs for the same sample, and ground truth image respectively.



Examples of the different outputs of our method along with the input label map and ground truth image



Examples of style transfer by using input label maps and style images from 10 different subjects.



Paper and Supplementary Material

S. Suri*, M. Meshry*, L. Davis, A. Shrivastava.
GRIT: GAN Residuals for Image-to-Image Translation
Paper | Supplementary




Template credits