Composition
Our findings indicate that HVAE consistently outperforms other models on composition. While both VAE and GAN show similar performance, they have distinct drawbacks: VAE reconstructions maintain image structure but appear too blurry, while GAN alters image structure, especially with complex datasets.
The advantage of HVAE over other models becomes clear as data complexity increases (e.g., with datasets like CelebA and ADNI). This improvement may be attributed to HVAE's multiple stochastic layers of latent variables, which allow it to more effectively learn and retain the prior of the latent from the data distribution across composition cycles.
Figure 4: Qualitative evaluation of composition across all datasets/graphs. From left to right across all datasets:
(i) factual, (ii) null-intervention (reconstruction) (iii) 10 cycles of null-intervention
Effectiveness
We attribute HVAE’s superior performance in generating counterfactuals to the high expressivity of its hierarchical latent variables. This hierarchical structure enables HVAE to align closely with the underlying causal graph, allowing it to retain the semantic information of the original image while implementing the intended causal modifications effectively.
The rich representation of latent variables in HVAE not only preserves the original content but also allows for more accurate and nuanced manipulation of causal factors. This results in counterfactuals that are both effective and plausible, making HVAE a powerful tool for counterfactual generation.
Figure 5: Qualitative evaluation of effectiveness for all datasets/graphs. From left to right across datasets: the leftmost image is the factual one and then each column shows the causal counterfactual image after intervening on a single attribute.
v: volume; vent: ventricle; eg: eyeglasses.
Realism & Minimality
We found that HVAE generates the most realistic counterfactuals across most metrics and datasets, namely the ones closest to the original data distribution. Introducing more attributes to condition the HVAE (complex graph on CelebA) affected image realism, while GAN maintained good performance. The GAN model, while capable of generating realistic images, failed to achieve optimal minimality due to its lower ability to preserve factual details, compared to HVAE, which is also evident by the composition metric.