Benchmarking Counterfactual Image Generation

1National & Kapodistrian University of Athens, Greece, 2Archimedes/Athena RC, Greece
3The University of Edinburgh, UK, 4Imperial College London, UK
5The University of Essex, UK, 6Spotify

*Indicates Equal Contribution

Abstract

Generative AI has revolutionised visual content editing, empowering users to effortlessly modify images and videos. However, not all edits are equal. To perform realistic edits in domains such as natural image or medical imaging, modifications must respect causal relationships inherent to the data generation process. Such image editing falls into the counterfactual image generation regime. Evaluating counterfactual image generation is substantially complex: not only it lacks observable ground truths, but also requires adherence to causal constraints. Although several counterfactual image generation methods and evaluation metrics exist, a comprehensive comparison within a unified setting is lacking. We present a comparison framework to thoroughly benchmark counterfactual image generation methods. We integrate all models that have been used for the task at hand and expand them to novel datasets and causal graphs, demonstrating the superiority of Hierarchical VAEs across most datasets and metrics. Our framework is implemented in a user-friendly Python package that can be extended to incorporate additional SCMs, causal methods, generative models, and datasets for the community to build on.

What is Counterfactual Image Generation?

Counterfactual image generation is an exciting frontier in generative AI that creates plausible alternative versions of images to explore "what if" scenarios. Unlike traditional image editing, which often alters images without considering the underlying causal relationships, it explicitly models these essential connections to produce realistic and coherent results.

Counterfactuals allow us to visualize potential consequences of changing specific attributes, such as age or gender, within an image. For instance, modifying a person's age might lead to an unrealistic outcome if not informed by causal relationships. By addressing these nuances, counterfactual image generation aims to ensure that the edited images reflect plausible realities rather than misleading artifacts.

Figure 1: (a) A plausible causal graph for human faces; (b) Factual images (no intervention); (c) Causal counterfactual images using the graph of (a) to perform the interventions do(Female) (upper panel) and do(Young) (lower panel); (d) Non-causal image editing.
Figure 1: (a) A plausible causal graph for human faces; (b) Factual images (no intervention); (c) Causal counterfactual images using the graph of (a) to perform the interventions do(Female) (upper panel) and do(Young) (lower panel); (d) Non-causal image editing.

Generate Counterfactuals

using the models we benchmark

Dataset Preview
Factual Image
Factual
VAE Image
VAE
HVAE Image
HVAE
GAN Image
GAN

Key Contributions

  1. We introduce a comprehensive framework to evaluate image generation models within the Deep-SCM paradigm across various datasets, including synthetic, natural, and medical images.
  2. We expand existing models to handle previously untested datasets and causal graphs, specifically testing HVAE and GAN on a non-trivial causal graph for human faces and devising a GAN architecture that generates counterfactual brain MRIs given multiple variables.
  3. We extensively benchmark these models adopting multiple metrics for assessing causal SCM-based counterfactual image generation.
  4. We offer a user-friendly Python package to accommodate and evaluate forthcoming causal mechanisms, datasets and causal graphs.

This work aspires to be the go-to resource for researchers and developers, paving the way for future advancements in counterfactual image generation.

How are counterfactuals obtained?

To obtain counterfactuals, we utilize the Abduction-Action-Prediction paradigm. This involves three key steps: Abduction, where we infer the noise compatible with an observation; Action, where we alter the structural equations to reflect our intervention; and Prediction, where we compute the resulting outcomes based on the modified model. Following previous work, we incorporate Conditional Normalizing Flows for the mechanisms of the image attributes, while for the image mechanisms we condider (i) Conditional Variational Autoencoders (VAEs), (ii) Conditional Hierarchical VAEs (HVAEs) and (iii) Conditional Generative Adversarial Networks (GANs).


Figure 2: Producing a counterfactual MorphoMNIST digit: Abduction: We infer the exogenous noise separately, using Normalising Flows f −1 for the attributes and the encoder of the image mechanism (e.g. VAE, HVAE or GAN), conditioned on factual parents. Action: We intervene (do(t ∗ )) only on thickness. Prediction: We employ the Normalizing Flow f conditioned on the counterfactual thickness t ∗ to obtain i ∗ after the intervention. Note that this is not needed for t ∗ on which we intervene and for d ∗ that has no parents in the DAG. Finally, the decoder generates the counterfactual image, given the exogenous noise U img and all counterfactual attributes.
Figure 2: Producing a counterfactual MorphoMNIST digit: Abduction: We infer the exogenous noise separately, using Normalising Flows f-1 for the attributes and the encoder of the image mechanism (e.g. VAE, HVAE, or GAN), conditioned on factual parents. Action: We intervene (do(t*)) only on thickness. Prediction: We employ the Normalizing Flow f conditioned on the counterfactual thickness t* to obtain i* after the intervention. Note that this is not needed for t* on which we intervene and for d* that has no parents in the DAG. Finally, the decoder generates the counterfactual image, given the exogenous noise Uimg and all counterfactual attributes.

Key Metrics

  • Composition: Ensures the image remains unchanged under "null-intervention" (when no change is applied). We use distances like the L1 distance and LPIPS to measure this.
  • Effectiveness: Confirms the intervention worked as intended. We use predictors on image features and metrics like accuracy and F1-score for evaluation.
  • Realism: Measures how similar the counterfactual is to real images, using the Fréchet Inception Distance (FID).
  • Minimality: Ensures changes are focused on the intended attribute, measured with Counterfactual Latent Divergence (CLD) for closeness to the original.

In essence, evaluating counterfactual images requires balancing these metrics to ensure images are realistic, minimal in changes, and effective in showcasing intended modifications.

Composition

Our findings indicate that HVAE consistently outperforms other models on composition. While both VAE and GAN show similar performance, they have distinct drawbacks: VAE reconstructions maintain image structure but appear too blurry, while GAN alters image structure, especially with complex datasets.

The advantage of HVAE over other models becomes clear as data complexity increases (e.g., with datasets like CelebA and ADNI). This improvement may be attributed to HVAE's multiple stochastic layers of latent variables, which allow it to more effectively learn and retain the prior of the latent from the data distribution across composition cycles.

Figure 4: Qualitative evaluation of composition across all datasets/graphs. From left to right across
          all datasets: (i) factual, (ii) null-intervention (reconstruction) (iii) 10 cycles of null-intervention
Figure 4: Qualitative evaluation of composition across all datasets/graphs. From left to right across all datasets:
(i) factual, (ii) null-intervention (reconstruction) (iii) 10 cycles of null-intervention

Effectiveness

We attribute HVAE’s superior performance in generating counterfactuals to the high expressivity of its hierarchical latent variables. This hierarchical structure enables HVAE to align closely with the underlying causal graph, allowing it to retain the semantic information of the original image while implementing the intended causal modifications effectively.

The rich representation of latent variables in HVAE not only preserves the original content but also allows for more accurate and nuanced manipulation of causal factors. This results in counterfactuals that are both effective and plausible, making HVAE a powerful tool for counterfactual generation.

Figure 5: Qualitative evaluation of effectiveness for all datasets/graphs. From left to right across
    datasets: the leftmost image is the factual one and then each column shows the causal counterfactual
    image after intervening on a single attribute. v: volume; vent: ventricle; eg: eyeglasses.
Figure 5: Qualitative evaluation of effectiveness for all datasets/graphs. From left to right across datasets: the leftmost image is the factual one and then each column shows the causal counterfactual image after intervening on a single attribute.
v: volume; vent: ventricle; eg: eyeglasses.

Realism & Minimality

We found that HVAE generates the most realistic counterfactuals across most metrics and datasets, namely the ones closest to the original data distribution. Introducing more attributes to condition the HVAE (complex graph on CelebA) affected image realism, while GAN maintained good performance. The GAN model, while capable of generating realistic images, failed to achieve optimal minimality due to its lower ability to preserve factual details, compared to HVAE, which is also evident by the composition metric.

🎉 Summing it up! 🎉

Our work bridges a critical gap in counterfactual image generation by introducing a unified framework that rigorously benchmarks diverse models, datasets, and causal graphs. With this standardized evaluation, we uncover the exceptional expressivity of hierarchical structures in HVAE, which outperforms VAEs and GANs by capturing causal variables more accurately and preserving semantic details. Notably, our experiments with complex datasets—like human faces and brain MRIs—highlight HVAE’s capacity to generate realistic, causally sound counterfactuals. As we extend the reach of this benchmark, we look forward to seeing the community build on our foundation, exploring new methods, models, and causal graphs. To support these efforts, we’ve created an easy-to-use Python package that enables future innovations in counterfactual generation.

BibTeX

If you find this work helpful in your research, cite:
        @misc{melistas2024benchmarking,
              title={Benchmarking Counterfactual Image Generation},
              author={Thomas Melistas and Nikos Spyrou and Nefeli Gkouti and Pedro Sanchez and Athanasios Vlontzos and Yannis Panagakis and Giorgos Papanastasiou and Sotirios A. Tsaftaris},
              year={2024},
              eprint={2403.20287},
              archivePrefix={arXiv},
              primaryClass={cs.CV}
        }