06/09/2026 | Press release | Distributed by Public on 06/09/2026 08:17
Cornell researchers are investigating the potential for using artificial intelligence to give robots social intelligence - the ability to read facial cues, anticipate the needs of those around them and function within society.
A new study tested the ability of vision language models (VLMs) - AI systems that can interpret and generate both visual information and language - to predict whether a tense scenario in a short video would end well or badly, such as a toddler carrying an overly full mug of coffee. Researchers also asked the VLMs to make predictions based solely on the facial expressions of people watching those same scenarios. While the best VLMs surpassed the average human's ability to predict the ending, the models failed terribly at anticipating the ending based only on facial expressions.
"We emit social cues when we're interacting with the world," said Maria Teresa Parreira, a doctoral student in the field of information science and lead researcher on the project. "For a robot that's working in a shared human space, the ability to take in this information is key for the robot to operate in a satisfactory way."
Parreira presented the study, "Bad Idea or Good Prediction? Comparing VLM and Human Anticipatory Judgment," in March at the ACM/IEEE International Conference on Human-Robot Interaction.
"Humans are so good and so sensitive to other people's reactions," said senior author Wendy Ju, associate professor of information science and design tech at Cornell Tech, the Cornell Ann S. Bowers College of Computing and Information Science and the Jacobs Technion-Cornell Institute. "It makes us able to know things from other people that we don't know ourselves, and we're just trying to give robots that intelligence too."
The new work builds on a previous study by Parreira et al., in which a team successfully trained AI models to predict whether a scenario would end well or poorly using individuals' facial expressions in response to the videos.
AI models performed poorly when asked to predict the ending of a scenario, such as this video of a person doing a handstand on a hoverboard, based solely on people's facial reactions while watching the video. (Spoiler alert: This video does not end well.)
In the new study, researchers used the same set of scenarios to see if existing, off-the-shelf VLMs possessed the ability to predict outcomes based on context clues and human facial expressions. They tested three state-of-the-art, closed-source models, such as OpenAI's GPT-4o and Google's Gemini 2.0 Flash, and three open-source, publicly available models, such as DeepSeek. While the closed-source models are bigger, more powerful and trained on more data, open-source models are more likely to be deployed in robots, because they don't require cloud access to operate and offer better privacy for users.
The videos they asked the models to evaluate were action sequences, including a man riding a lawnmower at high speed and a humanoid robot trying to jump between blocks.. The best open-source model predicted the endings with 70% accuracy, while the best closed-sourced model was about 63% accurate, performing about as well as the average human.
But when the researchers asked the models to make predictions based on videos or images of human reactions to the scenarios, their performance plummeted. The accuracy of the predictions fell to between 44.5% and 53.8%, with some models even giving the same response for every video.
These findings suggest that current VLMs have a serious deficit in anticipatory social intelligence - they can't interpret human facial expressions and use that information to anticipate outcomes - an important skill for successful human-robot interaction.
Now, Parreira and Ju are interested in exploring why these models are failing, and whether they can be prompted to perform better.
They hope this work will inspire other researchers to investigate how AI can be applied in robots to enable social intelligence. "It's a really big space to explore," Parreira said. "There is a lot of information expressed through social signals. Harnessing it will be important to integrate robots in human environments."
The new findings also emphasize the importance of developing robots alongside humans, the researchers said. "Too many people wait until they have built a robot that they think works perfectly," Ju said. "When they try it, they're always surprised to find out what the context demands and how people react."
Ju thinks the better approach is to deploy robots before they are "perfect" so they can see what mistakes they make and how humans interact with them - then adapt accordingly.
"Robots can learn on the job," she said.
Hongjin Quan, M.S. '25, and Adolfo Ramirez-Aristizabal of Accenture also contributed to the study.
This research was supported in part by the NVIDIA Academic Grant Program.
Patricia Waldron is a writer for the Cornell Ann S. Bowers College of Computing and Information Science.