How robust and reliable can we expect language models to be?

This paper investigates robustness in how we measure performance of language models like ChatGPT.

If you were to ask a language model two very similar questions, e.g., ‘Do you like this movie review?’ vs. ‘Tell me if you like this movie review.’, you’d expect to get the same answer. But is this always the case?

Far from it! In this paper we show that responses from language models vary widely depending on how you phrase your instruction, making responses unreliable. Moreover, the simplest instructions don’t always work best. Sometimes an instruction that we find overly complicated, e.g., ‘Might this movie critique be found positive?’, can give you much better results than ‘Do you like this movie review?’.

Lastly, the same instruction that works very well for one language model can give you very poor results for another model. This raises questions for how we measure progress in AI. If performance of any one language model depends so much on the question asked (and the questions are rarely reported), it’s hard to compare two models fairly or rely on existing results. We end this paper with actionable suggestions for more reliable, reproducible performance evaluations.

Reference: Leidinger, Alina, Robert van Rooij, and Ekaterina Shutova. "The language of prompting: What linguistic properties make a prompt successful?." Findings of the Association for Computational Linguistics: EMNLP 2023. 2023.

Other papers

Are LLMs classical or nonmonotonic reasoners? Lessons from generics
Quantifying Context Mixing in Transformers
Reclaiming AI as a theoretical tool for cognitive science
Dealing with semantic underspecification in multimodal NLP
Which stereotypes do search engines come with?
Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?