Language and Gesture in Virtual Reality: Is a Gesture Worth 1000 Words?
Abstract
Robots are increasingly incorporating multimodal information and human signals to resolve ambiguity in embodied human-robot interaction. Harnessing signals such as gestures may expedite robot exploration in large, outdoor urban environments for supporting disaster recovery operations, where speech may be unclear due to noise or the challenges of a dynamic and dangerous environment. Despite this potential, capturing human gesture and properly grounding it to crowded, outdoor environments remains a challenge. In this work, we propose a method to model human gesture and ground it to spoken language instructions given to a robot for execution in large spaces. We implement our method in virtual reality to develop a workflow for faster future data collection. We present a series of proposed experiments that compare a language-only baseline to our proposed language supplemented by gesture approach, and discuss how our approach has the potential to reinforce the human’s intent and detect discrepancies in gesture and spoken instructions in these large and crowded environments.