Human-Computer Interaction

Natural Pointing


This project is already assigned.



The use of Virtual Reality (VR) applications has become increasingly popular, due to the wide variety of applications, such as in medicine (Hoffman et al. 2000), teaching (Oberdörfer & Latoschik, 2016) and entertainment (Pausch et al. 1996), and the more cost-effective alternative to conventional therapeutic approaches (Rothbaum et al., 2000). Van Dam et al. define (immersive) Virtual Reality as technology that gives the user the perception of being surrounded by a virtual, i.e. computer-generated environment (VE). (Van Dam et al., 2000). Slater and Wilbur define the term immersion to be the „extent to which the computer displays are capable of delivering an inclusive, extensive, surrounding and vivid illusion of reality to the senses of a human participant“ (Slater & Wilbur, 1997). A concept that results from immersion is presence. Witmer and Singer (1998) describe presence as a subjective experience of being in a place or an environment, despite the fact that one is physically in a different place (Witmer & Singer, 1998). The more immersion is caused by the VE, the stronger the experience of presence becomes (Witmer & Singer, 1998). Presence is a very important factor, since the level of presence affects the probability of which the user will behave in the VE similar to the real world (Slater & Wilbur, 1997). It is important to achieve the state of Presence because the benefits of VEs, which are used for example in psychotherapy, can only be used if the patient in the VE shows similar behaviour to the behaviour in the real world (Strickland, 1996). Nowadays VR applications use a Head-Mounted Display (HMD), which represents the VE by means of small displays in the user’s field of vision to generate such an experience, since they are a low-cost alternative to other techniques. One of the main interaction of these VEs is the manipulation of visually perceptible objects (Pfeiffer et al., 2008). In order to manipulate objects in VR, the user has to be able to select a desired object, which will be the target for future actions (Steed, 2006). The most common way to illustrate, i.e. select, the target object in reality is to use gestures, especially pointing gestures, which are a fundamental primitive of interaction (Steed, 2006) and a unique human behaviour that distinguishes humans from primates (Kita, 2003).Not only the selection, but all interactions should be as natural as possible to make the interaction intuitive and usable, since the quality of an user interface is an essential component for experiencing presence (Slater & Wilbur, 1997). Due to the fact that the quality of the user interface depends on its usability, it is necessary to guarantee it. Argelaguet and Anduar (2013) have mentioned four basic aspects of usability that a good selection method must have: The fast selection of objects, a precise and error-proof interaction, an easy to understand and control application, which prevents exhaustion, while support possibilities of the interaction by for example additional menus, which interrupt the natural interaction, disturb the experience of immersion (Mendes et al., 2017). A natural interaction is understood to be an interaction that reflects the behaviour of humans in reality as closely as possible (Zachmann & Rettig, 2001). The advantage of natural interactions in the virtual environment is the ability to take advantage of the diversity of human interactions, allowing users to simultaneously control more degrees of freedom and use familiar actions from the real world (Argelaguet & Anduar, 2013), thus avoiding potential learning efforts. In order to increase the intuitiveness of these natural interactions, VR application often make use of several input channels (Zachmann & Rettig, 2001) in terms of Multimodal Interfaces (MMIs).These MMIs combine at least two modalities that can possibly function simultaneously. Suitable modalities can support each other in their interaction. For example, interactions that are impossible by using gestures alone can be easily implemented by adding speech. The potential benefits of these MMIs are increased expressiveness, flexibility, accessibility, reliability and efficiency (Sharma et al., 1998; Oviatt et al., 2014; Latoschik et al., 1998). In VEs 3D gestures like pointing at the object with hands, head or eyes are used to select targets (Steed, 2006). To implement the possibility of selecting an object in the VE by pointing on it, the Ray Casting method is often used. The Ray Casting method uses the properties of the pointing gesture in human communication and forms a vector that starts from a body part and clarifies a certain direction, a place or an object (Kita, 2003). It is among the most popular methods for selection as well due to it’s easiness and intuitiveness as the ability to select targets that are far away (Lee et al., 2003) and offers promising results (Pfeiffer et al., 2008).


The problem with current selection methods is that the decision whether an item has been selected is binary. Either it was selected or it was not. However, this means that when selecting objects, you must point precisely at the object to be selected and no intermediate values in the form of a probability that the object has been selected are allowed. However, this can lead to increased physical effort. In connection with this, a beam is used to implement these selection methods, which is represented in the VE. Advanced selection methods like the Expand technique (Cashion et al., 2012) or the SQUAD technique (Kopper et al., 2011) use additional menus to facilitate the selection. However, since the ray of the ray casting method must intersect with the target object, the user is forced to keep the orientation of his input device rigid until confirmation (Argelaguet & Andujar, 2013). For this purpose, the user usually has to hold his hands in an unnatural starting position and is dependent on the fact that he has to be able to see the beam emitted by his controller. However, this influences the naturalness of the interaction as well as the immersion.


The goal of this HCI project is to develop a technique for selecting objects which enhances the naturalness of the interaction, while keeping the effectiveness as close to ray casting as possible. For this purpose, it should be found out as precisely as possible to which object the user refers at the time of the selection action by using gesture input in the most natural way possible. In order to achieve this, as many attributes of human behaviour as possible, such as posture, fixation of the eyes, orientation of the arm, etc., must be taken into account at the time the pointing gesture is executed to analyze the selection behavior of the users. In order to ensure the naturalness of the interaction and the associated immersion, the use of tracking hardware, such as controllers, and additional levels of abstraction, as in the SQUAD or Expand technique, should be kept low or avoided. The developed technology can be used on its own, but in combination with MMIs it unfolds its full potential. In this context, speech would be an additional input channel an could be used as confirmation trigger for selected objects, since it, in combination with the gesture, resolves ambiguity and therefore would make gesture recognition more robust and efficient (Latoschik et al., 1998).


The developed technique is to become an improved ray casting technique that uses multiple rays from the user’s body to more accurately determine the target object. Since the Ray Casting method, but especially the Cone Casting method, tends to select more than one object in dense VEs, but can only select one object due to its binary nature, it is important to ensure a method that makes the user’s selection unique and provides continuous probability values between 0 and 1. By using the developed technique, probability values should be provided whether an object has been selected, which are continuous instead of binary. This can help to avoid that the ray has to intersect with the object exactly and thus prevents unnatural interactions. Argelaguet and Andujar (2013) distinguish the methods of disambiguation into three types: manual (the additional decision of which of the selected objects is the target object), heuristic (the use of created heuristics that make a prediction about the object selected by the user) and behavioral (the evaluation of the user’s behavior before selection in order to make a prediction about the target object). As the additional confirmation of the target object causes a cognitive workload and thus worsens the usability of the interaction, the improved ray casting method shall use a heuristic discrimination method. In order to be able to establish a heuristic that gives a confidence value, it needs an insight into the behaviour of the user, which he shows during the selection of objects. Since there are approximate characteristics of human behavior when pointing to objects, but to our knowledge there are no exact reference values, a neural network should learn these from training data using suitable features (e.g. the distance of the center of an object to the rays, which emanates from the user) For this purpose, training data will be generated by users who select target objects in the VE. To ensure the naturalness of the interaction, the controllers are replaced by the actual hands of the user, which are recognized by the Head-Mounted Display (HMD). For this purpose, hardware components are selected that make it possible to guarantee as much information as possible despite having fewer aids. For example, the Vive Pro Eye or the Oculus Quest could be considered, as they allow tracking hands without a controller.


In order to find out whether the developed technique is perceived to be more natural compared to the conventional ray casting technique, the developed technique will be evaluated in a user study. Here, the users are to select objects that are given to them in different scenarios. The following hypotheses are tested, which can be derived from the literature:


To verify the above hypotheses, both methods are tested in several scenarios with the following measurements

Time schedule


Argelaguet, F., & Andujar, C. (2013). A survey of 3D object selection techniques for virtual environments. Computers & Graphics, 37(3), 121-136.

Bacim, F., Kopper, R., & Bowman, D. A. (2013). Design and evaluation of 3D selection techniques based on progressive refinement. International Journal of Human-Computer Studies, 71(7-8), 785-802.

Bowman, D., Wingrave, C., Campbell, J., & Ly, V. (2001). Using pinch gloves (tm) for both natural and abstract interaction techniques in virtual environments.

Cashion, J., Wingrave, C., & LaViola Jr, J. J. (2012). Dense and dynamic 3d selection for game-based virtual environments. IEEE transactions on visualization and computer graphics, 18(4), 634-642.

Hoffman, H. G., Doctor, J. N., Patterson, D. R., Carrougher, G. J., & Furness III, T. A. (2000). Virtual reality as an adjunctive pain control during burn wound care in adolescent patients. Pain, 85(1-2), 305-309.

Kita, S. (2003). Pointing: A foundational building block of human communication. In Pointing (pp. 9-16). Psychology Press.

Kopper, R., Bacim, F., & Bowman, D. A. (2011, March). Rapid and accurate 3D selection by progressive refinement. In 2011 IEEE Symposium on 3D User Interfaces (3DUI) (pp. 67-74). IEEE.

Latoschik, M. E., Frohlich, M., Jung, B., & Wachsmuth, I. (1998). Utilize speech and gestures to realize natural interaction in a virtual environment. In IECON’98. Proceedings of the 24th Annual Conference of the IEEE Industrial Electronics Society (Cat. No. 98CH36200) (Vol. 4, pp. 2028-2033). IEEE.

Lee, S., Seo, J., Kim, G. J., & Park, C. M. (2003, April). Evaluation of pointing techniques for ray casting selection in virtual environments. In Third international conference on virtual reality and its application in industry (Vol. 4756, pp. 38-44). International Society for Optics and Photonics.

Liang, J., Green, M. (1994). JDCAD: A highly interactive 3D modeling system. Computers & graphics, 18(4), 499-506.

Mendes, D., Medeiros, D., Sousa, M., Cordeiro, E., Ferreira, A., & Jorge, J. A. (2017). Design and evaluation of a novel out-of-reach selection technique for VR using iterative refinement. Computers & Graphics, 67, 95-102.

Meyer, D. E., Abrams, R. A., Kornblum, S., Wright, C. E., & Keith Smith, J. E. (1988). Optimality in human motor performance: ideal control of rapid aimed movements. Psychological review, 95(3), 340.

Oberdörfer, S., & Latoschik, M. E. (2016, November). Interactive gamified 3D-training of affine transformations. In Proceedings of the 22nd ACM Conference on Virtual Reality Software and Technology (pp. 343-344). ACM.

Pausch, R., Snoddy, J., Taylor, R., Watson, S., & Haseltine, E. (1996, August). Disney’s Aladdin: first steps toward storytelling in virtual reality. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques (pp. 193-203). ACM.

Pfeiffer, T., Latoschik, M. E., & Wachsmuth, I. (2008, March). Conversational pointing gestures for virtual reality interaction: implications from an empirical study. In 2008 IEEE Virtual Reality Conference (pp. 281-282). IEEE.

Poupyrev, I., Ichikawa, T., Weghorst, S., & Billinghurst, M. (1998, August). Egocentric object manipulation in virtual environments: empirical evaluation of interaction techniques. In Computer graphics forum (Vol. 17, No. 3, pp. 41-52). Oxford, UK and Boston, USA: Blackwell Publishers Ltd.

Oviatt, S.; Coulston, R.; Lunsford, R. When do we interact multimodally?: Cognitive load and multimodal communication patterns. In Proceedings of the 6th International Conference on Multimodal Interfaces, State College, PA, USA, 13–15 October 2004; pp. 129–136.

Rothbaum, B. O., Hodges, L., Smith, S., Lee, J. H., & Price, L. (2000). A controlled study of virtual reality exposure therapy for the fear of flying. Journal of consulting and Clinical Psychology, 68(6), 1020.

Sharma, R.; Pavlovic, V.I.; Huang, T.S. Toward multimodal human-computer interface. Proc. IEEE 1998, 86, 853–869. doi:10.1109/5.664275.

Slater, M., & Wilbur, S. (1997). A framework for immersive virtual environments (FIVE): Speculations on the role of presence in virtual environments. Presence: Teleoperators & Virtual Environments, 6(6), 603-616.

Steed, A., & Parker, C. (2004, May). 3D selection strategies for head tracked and non-head tracked operation of spatially immersive displays. In 8th International Immersive Projection Technology Workshop (pp. 13-14).

Steed, A. (2006, March). Towards a general model for selection in virtual environments. In 3D User Interfaces (3DUI’06) (pp. 103-110). IEEE.

Strickland, D. (1996). A virtual reality application with autistic children. Presence: Teleoperators and Virtual Environments, 5(3), 319-329.

Van Dam, A., Forsberg, A. S., Laidlaw, D. H., LaViola, J. J., & Simpson, R. M. (2000). Immersive VR for scientific visualization: A progress report. IEEE Computer Graphics and Applications, 20(6), 26-52.

Witmer, B. G., & Singer, M. J. (1998). Measuring presence in virtual environments: A presence questionnaire. Presence, 7(3), 225-240.

Zachmann, G., & Rettig, A. (2001, July). Natural and robust interaction in virtual assembly simulation. In Eighth ISPE International Conference on Concurrent Engineering: Research and Applications (ISPE/CE2001) (Vol. 1, pp. 425-434).

Zimmerer, C., Fischbach, M., & Latoschik, M. (2018). Semantic Fusion for Natural Multimodal Interfaces using Concurrent Augmented Transition Networks. Multimodal Technologies and Interaction, 2(4), 81.

Contact Persons at the University Würzburg

Chris Zimmerer (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg

Dr. Martin Fischbach (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg

Legal Information