Benchmarking Counterfactual Image Generation

1National & Kapodistrian University of Athens, Greece, 2Archimedes/Athena RC, Greece
3The University of Edinburgh, UK, 4Imperial College London, UK
5The University of Essex, UK, 6Spotify

*Indicates Equal Contribution

Abstract

Generative AI has revolutionised visual content editing, empowering users to effortlessly modify images and videos. However, not all edits are equal. To perform realistic edits in domains such as natural image or medical imaging, modifications must respect causal relationships inherent to the data generation process. Such image editing falls into the counterfactual image generation regime. Evaluating counterfactual image generation is substantially complex: not only it lacks observable ground truths, but also requires adherence to causal constraints. Although several counterfactual image generation methods and evaluation metrics exist, a comprehensive comparison within a unified setting is lacking. We present a comparison framework to thoroughly benchmark counterfactual image generation methods. We integrate all models that have been used for the task at hand and expand them to novel datasets and causal graphs, demonstrating the superiority of Hierarchical VAEs across most datasets and metrics. Our framework is implemented in a user-friendly Python package that can be extended to incorporate additional SCMs, causal methods, generative models, and datasets for the community to build on.

What is Counterfactual Image Generation?

Counterfactual image generation is an exciting frontier in generative AI that creates plausible alternative versions of images to explore "what if" scenarios. Unlike traditional image editing, which often alters images without considering the underlying causal relationships, it explicitly models these essential connections to produce realistic and coherent results.

Counterfactuals allow us to visualize potential consequences of changing specific attributes, such as age or gender, within an image. For instance, modifying a person's age might lead to an unrealistic outcome if not informed by causal relationships. By addressing these nuances, counterfactual image generation aims to ensure that the edited images reflect plausible realities rather than misleading artifacts.

Figure 1: (a) A plausible causal graph for human faces; (b) Factual images (no intervention); (c) Causal counterfactual images using the graph of (a) to perform the interventions do(Female) (upper panel) and do(Young) (lower panel); (d) Non-causal image editing.
Figure 1: (a) A plausible causal graph for human faces; (b) Factual images (no intervention); (c) Causal counterfactual images using the graph of (a) to perform the interventions do(Female) (upper panel) and do(Young) (lower panel); (d) Non-causal image editing.

Generate Counterfactuals

using the models we benchmark

Dataset Preview
Factual Image
Factual
VAE Image
VAE
HVAE Image
HVAE
GAN Image
GAN

Key Contributions

  1. We introduce a comprehensive framework to evaluate image generation models within the Deep-SCM paradigm across various datasets, including synthetic, natural, and medical images.
  2. We expand existing models to handle previously untested datasets and causal graphs, specifically testing HVAE and GAN on a non-trivial causal graph for human faces and devising a GAN architecture that generates counterfactual brain MRIs given multiple variables.
  3. We extensively benchmark these models adopting multiple metrics for assessing causal SCM-based counterfactual image generation.
  4. We offer a user-friendly Python package to accommodate and evaluate forthcoming causal mechanisms, datasets and causal graphs.

This work aspires to be the go-to resource for researchers and developers, paving the way for future advancements in counterfactual image generation.

How are counterfactuals obtained?

To obtain counterfactuals, we utilize the Abduction-Action-Prediction paradigm. This involves three key steps: Abduction, where we infer the noise compatible with an observation; Action, where we alter the structural equations to reflect our intervention; and Prediction, where we compute the resulting outcomes based on the modified model. Following previous work, we incorporate Conditional Normalizing Flows for the mechanisms of the attributes, while for the image mechanisms we consider (i) Conditional Variational Autoencoders (VAEs), (ii) Conditional Hierarchical VAEs (HVAEs) and (iii) Conditional Generative Adversarial Networks (GANs).


Figure 2: Producing a counterfactual MorphoMNIST digit: Abduction: We infer the exogenous noise separately, using Normalising Flows f −1 for the attributes and the encoder of the image mechanism (e.g. VAE, HVAE or GAN), conditioned on the factual parents. Action: We intervene do(t ∗ ) only on thickness. Prediction: We employ the Normalizing Flow f conditioned on the counterfactual thickness t ∗ to obtain i ∗ after the intervention. Note that this is not needed for t ∗ on which we intervene and for d ∗ that has no parents in the DAG. Finally, the decoder generates the counterfactual image, given the exogenous noise U img and all counterfactual attributes.
Figure 2: Producing a counterfactual MorphoMNIST digit: Abduction: We infer the exogenous noise separately, using Normalising Flows f-1 for the attributes and the encoder of the image mechanism (e.g. VAE, HVAE, or GAN), conditioned on the factual parents. Action: We intervene do(t*) only on thickness. Prediction: We employ the Normalizing Flow f conditioned on the counterfactual thickness t* to obtain i* after the intervention. Note that this is not needed for t* on which we intervene and for d* that has no parents in the DAG.
Finally, the decoder generates the counterfactual image, given the exogenous noise Uimg and all counterfactual attributes.

Key Metrics

  • Composition: Ensures the image remains unchanged under "null-intervention" (when no change is applied). We use distances like the L1 distance and LPIPS to measure this.
  • Composition
    Composition

  • Effectiveness: Confirms the intervention worked as intended. We use predictors on image features and metrics like accuracy and F1-score for evaluation.
  • Effectiveness
    Effectiveness

  • Realism: Measures how similar the counterfactual is to real images, using the Fréchet Inception Distance (FID).
  • Realism
    Realism

  • Minimality: Ensures changes are focused on the intended attribute, measured with Counterfactual Latent Divergence (CLD) for closeness to the original.
  • Minimality
    Minimality

In essence, evaluating counterfactual images requires balancing these metrics to ensure images are realistic, minimal in changes, and effective in showcasing intended modifications.

Who is the winner?

Summary table showing the best performing models on each metric for all datasets
Summary table showing the best performing models on each metric for all datasets

Our findings indicate that HVAE consistently outperforms all other models on composition. VAE and GAN have distinct drawbacks: VAE reconstructions maintain image structure but appear too blurry, while GAN alters image structure, especially on complex datasets. The advantage of HVAE over other models becomes clear as data complexity increases (e.g. CelebA, ADNI). This improvement may be attributed to HVAE's multiple stochastic layers of latent variables, which allow it to more effectively learn and retain the prior of the latent from the data distribution.

When measuring effectiveness, HVAE has the best score on most attributes, with VAE and GAN being competitive on other attributes, depending on the dataset. We believe that HVAE's hierarchical structure enables it to align closely with the underlying causal graph, allowing of a more accurate and nuanced manipulation of causal factors. This resulting in counterfactuals that are both effective and plausible.

We found that HVAE generates the most realistic counterfactuals across most metrics and datasets. However, introducing more attributes to condition the HVAE (as in CelebA (complex)) affected image realism, exhibiting a known issue of its counterfactual training step, termed attribute amplification. GANs, while capable of generating realistic images, fail to achieve optimal minimality by preserving factual details, which is also evident through the composition metric.

🎉 Summing it up! 🎉

Our work bridges a critical gap in counterfactual image generation by introducing a unified framework that rigorously benchmarks diverse models, datasets, and causal graphs. With this standardized evaluation, we uncover the exceptional expressivity of hierarchical structures in HVAE, which outperforms VAEs and GANs by capturing causal variables more accurately and preserving semantic details. Notably, our experiments with complex datasets—like human faces and brain MRIs—highlight HVAE’s capacity to generate realistic, causally sound counterfactuals. As we extend the reach of this benchmark, we look forward to seeing the community build on our foundation, exploring new methods, models, and causal graphs. To support these efforts, we publish our open-source codebase to enable future contributions in counterfactual image generation.

BibTeX

If you find this work helpful in your research, cite:
        @inproceedings{
          melistas2024benchmarking,
          title={Benchmarking Counterfactual Image Generation},
          author={Thomas Melistas and Nikos Spyrou and Nefeli Gkouti and Pedro Sanchez and Athanasios Vlontzos and Yannis Panagakis and Giorgos Papanastasiou and Sotirios A. Tsaftaris},
          booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
          year={2024},
          url={https://openreview.net/forum?id=0T8xRFrScB}
        }