New Research Helps Robots Grasp Situational Context

CS Ph.D. student Vishnu Dorbala and UMD researchers advance embodied AI with situational queries for smarter household robots.
Descriptive image for New Research Helps Robots Grasp Situational Context

A team of researchers from the Institute for Systems Research (ISR) has introduced a pioneering approach that helps robots understand the world more like humans do, by taking context into account. Their paper, Is the House Ready For Sleeptime? Generating and Evaluating Situational Queries for Embodied Question Answering has been accepted to the prestigious 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), which will be held in Hangzhou, China.

The work began during an internship in Summer 2023 when UMD computer science Ph.D. student Vishnu Dorbala worked with Amazon’s Artificial General Intelligence (AGI) group. Over the past two years at the University of Maryland, Dorbala has significantly extended and refined the project in collaboration with ISR Professor and Executive Director of Innovations in AI Reza Ghanadan and Distinguished University Professor Dinesh Manocha. This sustained effort resulted in major advances in algorithm design and dataset development for Embodied AI.The research addresses a cutting-edge challenge in Embodied AI: how to design intelligent agents that not only respond to simple commands but can also reason about complex, real-world situations in human environments.

"Mobile robots are expected to make life at home easier. This includes answering questions about everyday situations like 'Is the house ready for sleeptime?',” said Dorbala. “Doing so requires understanding the states of many things at once, like the doors being closed, the fireplace being off, and so on. Our work provides a novel solution for this problem using Large Language Models (LLMs), paving the way towards making household robots smarter and more useful!"

Using an LLM, the team generated S-EQA data in VirtualHome with verified object states and relationships. A large-scale Amazon Mechanical Turk study confirmed data authenticity. Evaluation of LLMs on S-EQA showed strong performance in generating queries but weaker alignment in answering them, indicating limits in commonsense reasoning for this task.

Teaching Robots to “Get It”

In the field of Embodied Question Answering (EQA), robots are trained to navigate their surroundings and answer questions based on their visual observations. However, until now, most EQA systems have been limited to responding to simple, object-specific prompts, such as identifying the color of a couch or locating a knife on the counter.

The ISR-led team is shifting that paradigm.

Instead of asking, “Where is the light switch?” their system tackles far more complex queries, like “I'm traveling out of town for a few days, is the house all set?” This type of question requires understanding multiple elements within a space and how they relate to one another. For example, are the lights off? Are the doors closed? Are the windows locked? Is the thermostat setting appropriate for travel if no one is home? Is there enough food for my cat in the food dispenser? Are the laundry machines empty and off? These are what the team calls situational queries and answering them requires a level of consensus-based reasoning that today’s AI systems typically lack.

The team used GPT-4 to generate situational datapoints consisting of queries, consensus object states, and relationships. Data generation occurred over multiple iterations, with prompts refined after evaluating each batch. BERT embeddings were used to compare new queries with those in the existing Situational Query Database, ensuring novelty. Clustering methods identified representative queries for feedback. The process highlights the iterative pathway for producing high-quality, diverse situational questions.

Click HERE to read the full article

The Department welcomes comments, suggestions and corrections.  Send email to editor [-at-] cs [dot] umd [dot] edu.