Project details

Bias White
Inclusive White

Inclusive Image and Video Captioning

Current AI systems describing the content of images and videos in natural language often make assumptions about the gender, nationality, or physical appearance of the people in them. In this project, we aim to fix this unwanted behavior by developing more inclusive systems.

AI systems generating descriptions or captions of visual content, such as images and videos (a.k.a. multimodal models or language-and-vision models), are typically trained with texts whose content is maximally aligned with the content of the image/video. As a consequence, they exhibit a bias toward descriptions or captions that mention more details of the image/video over descriptions or captions that use less detailed (or less specified) language. While this behavior is intuitively sensible, it can also lead to unwanted and potentially harmful consequences.

For the image shown below, for example, a state-of-the-art model like CLIP (Radford et al. 2021) significantly prefers the description “The woman is standing above the two packed suitcases” over versions of the same sentence where we use more inclusive language, that is, “the person is” or “they are” instead of “the woman is” (example from Pezzelle, 2023; see also our post about it). We argue that this is a problem since the model arbitrarily assumes the gender of the person in the image, who—in this specific case—may or may not self-identify with the female gender. As this example reveals, the model’s bias is so strong that it emerges even though the person is not entirely visible in the image.

In this project, we aim to fix this crucial problem of current multimodal models and develop new systems that can describe images and videos in full detail while using more inclusive language.

Papers related to this project

Dealing with semantic underspecification in multimodal NLP