Dealing with semantic underspecification in multimodal NLP

In this paper we explore how even AI systems that can access both visual information and language (e.g., a video and a description of its context) have trouble understanding information when it is not sufficiently specific – unlike humans.

To use language like humans do, intelligent AI systems must understand and interpret correctly any type of language, including phrases and sentences that do not provide all the information needed for clear communication. For instance, the word “they” may refer to two or thousands of people, as well as to a single person (of whom we don’t know or don’t want to disclose their gender). These and similar examples are not mistakes in language. In contrast: They are a helpful feature, because they make language more efficient – and people can, in fact still understand each other.

While current Large Language Models (LLMs) may not have access to all the extra information needed to interpret these sentences (for example, the visual context in which they are used), AI systems that combine language with pictures or videos have a better chance. In this publication, however, we show that even these models struggle, which can be a problem, if we want to use them in real-life situations. We argue that this should be solved if we want to develop language technology that can successfully interact with human users. We discuss some applications where this is the case and outline a few concrete directions toward achieving this goal.

Reference: Sandro Pezzelle. Dealing with semantic underspecification in multimodal NLP. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12098–12112, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.675.

Other papers

Are LLMs classical or nonmonotonic reasoners? Lessons from generics
Quantifying Context Mixing in Transformers
Reclaiming AI as a theoretical tool for cognitive science
How robust and reliable can we expect language models to be?
Which stereotypes do search engines come with?
Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?