Artificial Intelligence

“Explainable AI” is the subfield of AI concerned with providing explanations for how AI systems arrive at their predictions, decisions or other output.
Explainable Plum

Browse research content

These are samples of research projects, papers or blogs written by our researchers.

More about Explainable AI

The term “Explainable AI” has become very popular in recent years, but it is not so easy to define. We might be tempted to describe it as the subfield of Artificial Intelligence concerned with providing explanations for how AI systems arrive at their predictions, decisions or other output. That’s a good start, as long as we realize that it is an extremely diverse research area (“subfield” is a bit of an overstatement) and that people can mean very different things with “explanations”.

The term (often abbreviated to XAI) has become so popular in the last few years, because of the success of deep learning models in AI, which in many ways are the opposite of explainable. Deep Learning (DL) models are said to give rise to the so-called “Blackbox Problem”. That is the problem that the internal workings of the models remain unknown and inaccessible. Deep learning models, which typically involve a network of billions of ‘neurons’ and connections between them (loosely modeled after networks in the brain), often suffer more from this problem than other approaches to building AI systems. This is because it is inherently difficult to explain how a trained network of neurons solves a given task.

This lack of explainability is often problematic, because many such deep learning systems are already deployed in society. People whose lives are affected by these systems deserve and demand explanations and justifications, and companies and institutions deploying these systems need to be able to provide them and guarantee their systems’ safe usage.

Explainability by design

One approach to Explainable AI is therefore to build AI systems that are not based on deep learning, but instead opt for models that are explainable by design. An influential article arguing for this position is Cynthia Rudin’s (2019) ‘Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead’. Such explainable-by-design approaches encompass a huge variety of different modeling approaches, including rule-based models (learned or hand-designed), Bayesian models and variants of linear or polynomial regression.

A classic example of a formalism that is considered explainable by design, is decision trees. Such trees might be hand-designed, or automatically learned out of large datasets (e.g., using classical search, Bayesian or evolutionary algorithms). As long as these trees are not too large, they can be inspected by humans, to provide explanations for end users or for the developers themselves (for instance, to make sure they are not making use of questionable information, such as for instance area codes as in the example decision tree below).



Post-hoc Interpretability

Another approach to Explainable AI does, in contrast, embrace deep learning models, and tries to “open the blackbox” by developing techniques that can provide explanations about how the models arrive at their output. This approach is often called post-hoc interpretability, where “post-hoc” indicates that the interpretation is only arrived at after the fact, and “interpretability” marks that the techniques often only provide a first step towards an interpretation by the researcher (and that more work is needed to provide a real explanation for the end user!).

Post-hoc interpretability techniques can typically only provide approximate explanations, and it is often very difficult to gauge how accurate the provided explanations are. In recent years, a great variety of techniques have been proposed, differing radically in what they consider a useful explanation (e.g., referring to components inside the DL models, assigning importance scores to parts of the input, or generating natural language descriptions of the DL model’s decision process).

The most well-known examples of post-hoc interpretability are attribution methods such as Integrated Gradients (Sundararajan et al. 2017). These methods assign importance scores to parts of the input. In the example below (from our online demo), we see the output of a “sentiment classifier” that can predict whether a given sentence expresses a positive or negative opinion. The classifier predicts the label “positive” for the example sentence; the graph shows the output of two attribution methods. The Integrated Gradients method attributes this prediction mainly to the word “movie” (which is an implausible explanation), while Gradient-weighted Rollout (Abnar & Zuidema, 2020; Chefer et al., 2021) distinguishes between evidence in favor and evidence against and puts the highest scores (very plausibly) on “best” and “ridiculous” respectively.



Other approaches

There are also approaches to Explainable AI that combine “by design” and “post hoc”. Examples include deep learning models that, while being trained, are constrained in ways that aid later interpretability, or machine-learned Bayesian or symbolic models that become so complex that they need post-hoc techniques to become more explainable.


Abnar, S., & Zuidema, W. (2020, July). Quantifying Attention Flow in Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 4190-4197).

Chefer, H., Gur, S., & Wolf, L. (2021). Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 782-791).

Rudin, C. (2019) ‘Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead’, Nature Machine Intelligence, 1(5), pp. 206–215. Available at: https://doi.org/10.1038/s42256-019-0048-x.

Sundararajan, M., Taly, A. and Yan, Q. (2017) ‘Axiomatic attribution for deep networks’, in International Conference on Machine Learning. PMLR, pp. 3319–3328.

Projects within this research theme​

Explainable AI for Fraud Detection

In this project, we study and develop new generation AI systems for fraud detection which are not only accurate, but also explainable for different stakeholders and adaptive to certain business rules.
Trustworthy White
Explainable White

InDeep: Interpreting Deep Learning Models for Text and Sound

Goal of the project is to find ways to make popular Artificial Intelligence models for language, speech and music more explainable.
Explainable White
Theory-driven White

Explainability in Collective Decision Making

In this project, we develop methods for automatically generating explanations for why a given compromise (between disagreeing people) is the best available option.
Fairness White
Explainable White
Algorithm White

Can NLP bias measures be trusted?

van der Wall, O., Bachmann, D., Leidinger, A., van Maanden, L. Zuidema, W. & Schulz, K.

Automating the Analysis of Matching Algorithms

Endriss, U.

Participatory budgeting

Improving Language Model bias measures

Explainability in Collective Decision Making

Papers within this research theme

Quantifying Context Mixing in Transformers