Human-Computer Interaction

Natural Multimodal Speech and Gesture Interfaces for Virtual Reality - Towards Practical Design Guidelines


This project is already assigned.



Instruction-based Multimodal Interfaces (MMIs) using speech and gesture input offer a promising alternative to traditional, mostly controller-based 3D User Interfaces (3DUIs). MMIs use combined speech and gesture commands to disambiguate referents in the discourse and thus control applications. For example, ”Paint that [pointing gesture] object yellow.”, colors the object pointed at yellow in the application of Wolf et al. [38]. The use of MMIs for graphical displays reaches back until Bolt’s seminal ‘Put that there’ demonstration in 1980 [4]. Since then, a variety of applications using MMIs have been developed, e.g., [14, 15, 28–30]. Furthermore, they showed great potential in several Virtual Reality (VR) applications, e.g., [19–23]. The approach of combined speech and gesture input allows to draw on previously learned behavior, imitating ordinary human to human communication. In the context of Virtual Environments (VEs), an interaction is understood as natural when it reflects the behavior of humans in reality as much as possible [39]. When using a MMI that meets the user’s need for a natural way of interacting, the user does not need to learn the interaction rules or ideally nothing at all to use the interface it can be ”cognitively invisible” [3], so he interacts intuitively [26]. MMIs show great potential in terms of a lower cognitive load and reduced mental resources spent on the interaction paradigm itself [27]. This greatly benefits intuitiveness and ease of use. The more natural an interface is, the more intuitive it is to use, and its usability increases while the cognitive load of a user decreases [26]. Said MMIs also offer advantages regarding interaction, such as increased expressiveness, effectiveness, efficiency, flexibility, and reliability [19, 26, 27]. Zimmerer et al. [43] recently deployed a MMI that is on par with 3DUIs on creativity tasks in VR. Although their approach is rather strict with a limited naturalness and flexibility, they were able to show that their MMI performed as well as a 3DUI in this use case. Having pointed out these advantages, the question remains as to why MMIs are not yet widely used, but mainly in the research community. Partly, because the design of such interfaces still poses a challenge today, as there are only few and rather theoretical guidelines. Such as, for example, they only recommend considering MMIs [23], which leaves developers with only abstract advice. There is a lack of practical help for the development of user-oriented MMIs. This is in contrast to traditional 3DUIs, where toolkits like VRTK [33] and XR Interaction Toolkit [35] support developers in the implementation of such interfaces. These provide ready-to-use templates for Ray-casting, menu navigation or teleport, for example. There is also extensive theoretical background on 3DUIs, e.g., a comprehensive set of concrete guidelines presented in [23]. The lack of practical design guidelines for MMIs contributes to the limited number of interfaces.


Therefore, the aim of this thesis is to contribute to the body of research and a step towards practical design guidelines for MMIs. This is to be achieved, on the one hand, by literature review and an initial study to find out what a MMI has to look like in order to be as natural as possible. Based on these findings, a MMI will be derived and validated with state-of-the-art 3DUIs to find out its potential benefits, i.a., in terms of cognitive load and usability. This work focuses on the selection of objects beyond arm’s reach, since this is a fundamental task in a variety of 3D applications [5]. 3DUI comparison partners will be Ray-casting [5], as a typical representative of selection techniques implementing pointing-metaphors and Expand VR [13], since it has already shown great potential where Ray-casting reaches its limits [7, 42].


[4] R.A. Bolt.“put-that-there”: Voice and gesture at the graphics interface. p. 262–270, 1980. doi: 10.1145/800250.807503

[14] L. Hoste, B. Dumas, and B. Signer. Mudra: A unified multimodal interaction framework. 2011. doi: 10.1145/2070481.2070500

[15] R. D. Jacobson. Representing spatial information through multimodal interfaces. In Proceedings Sixth International Conference on Information Visualisation, pp. 730–734. IEEE, 2002.

[19] M. Latoschik, M. Frohlich, B. Jung, and I. Wachsmuth. Utilize speech and gestures to realize natural interaction in a virtual environment. vol. 4, pp. 2028 – 2033 vol.4, 01 1998. doi: 10.1109/IECON.1998. 724030

[20] M. E. Latoschik. A general framework for multimodal interaction in virtual reality systems: Prosa. In The Future of VR and AR Interfaces Multimodal, Humanoid, Adaptive and Intelligent. Proceedings of the Workshop at IEEE Virtual Reality, number 138, pp. 21–25, 2001.

[21] M. E. Latoschik. A user interface framework for multimodal vr interactions. In Proceedings of the 7th International Conference on Multimodal Interfaces, ICMI ’05, p. 76–83. Association for Computing Machinery, New York, NY, USA, 2005. doi: 10.1145/1088463. 1088479

[22] M. E. Latoschik, B. Jung, and I. Wachsmuth. Multimodale interaktion mit einem system zur virtuellen konstruktion. In Informatik’99, pp. 88–97. Springer, 1999.

[23] J. J. LaViola Jr, E. Kruijff, R. P. McMahan, D. Bowman, and I. P. Poupyrev. 3D User Interfaces: Theory and Practice (2nd Edition). Addison-Wesley Professional, 2017.

[26] S. Oviatt and P. Cohen. Perceptual user interfaces: Multimodal interfaces that process what comes naturally. Commun. ACM, 43(3):45–53, Mar. 2000. doi: 10.1145/330534.330538

[27] S. Oviatt, R. Coulston, and R. Lunsford. When do we interact multi-modally? cognitive load and multimodal communication patterns. pp. 129–136, 0 2004. doi: 10.1145/1027933.1027957

[28] F. Paterno, C. Santoro, J. Mantyjarvi, G. Mori, and S. Sansone. Authoring pervasive multimodal user interfaces. International Journal of Web Engineering and Technology, 4(2):235–261, 2008.

[29] D. Perzanowski, A. C. Schultz, W. Adams, E. Marsh, and M. Bugajska. Building a multimodal human-robot interface. IEEE intelligent systems, 16(1):16–21, 2001.

[30] I. Rauschert, P. Agrawal, R. Sharma, S. Fuhrmann, I. Brewer, A. MacEachren, H. Wang, and G. Cai. Designing a human-centered, multimodal gis interface to support emergency management. Proceedings of the ACM Workshop on Advances in Geographic Information Systems, pp. 119–124, 01 2002.

[33] E. Theston. Virtual reality toolkit

[35] Unity3D. Xr interaction toolkit.

[38] E. Wolf, S. Klüber, C. Zimmerer, J.-L. Lugrin, and M. E. Latoschik. ”paint that object yellow”: Multimodal interaction to enhance creativity during design tasks in vr. In Proceedings of the 21st ACM International Conference on Multimodal Interaction (ICMI ’19), pp. 195–204. New York, NY, USA, 2019. Best Paper Runner-Up.

[39] G. Zachmann and A. Rettig. Natural and robust interaction in virtual assembly simulation. 09 2001.

[43] C. Zimmerer, E. Wolf, S. Wolf, M. Fischbach, J.-L. Lugrin, and M. E. Latoschik. Finally on par?! multimodal and unimodal interaction for open creative design tasks in virtual reality. 2020.

Contact Persons at the University Würzburg

Chris Zimmerer (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg

Dr. Martin Fischbach (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg

Legal Information