Human-Computer Interaction

Integration of an Extended Augmented Transition Network into a Multimodal Processing Framework - Towards an Universal Solution

This project is already completed.

##Towards an Universal Solution

Human-computer interaction is highly context sensitive. For example, traditional input devices, like mouse and keyboard, are known to be well suited for two dimensional WIMP-style graphical user interfaces. However, the usage of these input devices in immersive and highly interactive three dimensional environments is often restricted or not possible at all.

In order to provide more natural means of interaction, the field of Multimodal Interaction (MMI) tries to embrace the user’s natural bahavior as the center of the human-computer interface [Oviatt & Cohen, 2000].

Especially Real-time Interactive Systems (RIS) in Augmented, Mixed, and Virtual Reality benefit from these alternative interaction techniques [M. E. Latoschik, 2005]. In RIS the user’s interaction strongly relates to the dynamically changing content of the virtual environment, which makes a common ground between the application’s state and the MMI module a necessity for resolving input from different modalities. This dependancy and the already complex module interplay of RIS plattforms are at risk of high coupling, leading to poor maintainability. This Coupling Dilemma [M. E. Latoschik & Tramberend, 2010] is a prominent issue in the RIS area. As a consequence, sophisticated demonstrations like Latoschik’s virtuelle Werkstatt [M. E. Latoschik, 2005] have no running build or successor todays. Ultimately scientific progress is getting delayed, due to limited repeatability and ability to build on previous results. The research plattform Simulator X [M. Latoschik & Tramberend, 2011] and its multimodal input Processing framework miPro [Fischbach, 2015] aim to counter the Coupling Dilemma.

This thesis proposes an extension to miPro. An extended Augmented Transition Network (ATN) is integrated as fusion engine into miPro’s core multimodal processing pipeline building towards a universal multimodal interaction framework. Special focus is placed on the non-functional requirement maintainability, especially reusability and repeatability.


In the scope of this thesis three categories of people using the implemented system are identified: Application developers (aDev) utilize its functionality to enrich applications with multimodal interaction techniques. ATN component developers (cDev) adapt and extend the predefined multimodal processing capabilities, and (user) users interact with the application multimodally.

The goal of this thesis is to extend the miPro framework so that it alleviate performing research in the area of multimodal interaction. This primary objective is subdivided into the following concrete milestones:

  1. Integration of the existing ATN component into the miPro framework.
  2. Design of universal interfaces for
    1. Application developers (aDev) to connect a predefined set of multimodal interaction techniques to their applications.
    2. ATN component developers (cDev) to
      1. extend and adapt the predefined set of multimodal interaction techniques.
      2. add new modalities.
      3. connect respective input devices and recognizers.
  3. Design and implementation of a preconfigured ATN layout capable of recognizing a predefined set of multimodal interaction techniques.
  4. Development of two complex proof of concept application to showcase the extension. Additionally, these applications serve as an example for both application developers (aDev) and ATN component developers (cDev).
  5. Maintainability, e.g. easy reusability for various applications, should be sustained.

Semantic-based software techniques and a unified access scheme to the application state as well as miPro’s unified processing concept allow for the developed extension to be utilized in arbitrary SimX applications. This enables researches to conduct experiments without having to implement the complete multimodal processing module beforehand or relying on a wizard of oz scenario. Ultimately, it aims to facilitate performing research in the area of multimodal interaction, specifically for virtual-, augmented- and mixed-reality.

Planned Approach

The following gantt diagram showcases the schedule for this master thesis. It puts the in section Objective defined milestones in a chronologically sorted order and gives a brief overview over the time management.

Figure 1: Gantt Diagram of the planned approach.

Example interaction

This thesis focuses on instruction-based multimodal interfaces in virtual environments, though the core implementation is usable for any scenarios. In order to illustrate the application case and derive requirements for the interface, an exemplary scenario, similar to World Builder (Branit, 2009), is introduced. Precise actions and their respective multimodal commands are specified and analysed. A context free grammar is established, which is used to develop the multimodal command parsing.

Figure 2: Exemplary scenario of a multimodal virtual reality application.

The exemplary scenario is as follows: A user is fully immersed into a three dimensional virtual environment. The reach of actions a user can perform consists of creating, modifying and deleting simple geometric bodies (e.g. sphere, cuboid, etc). Further any bodies can be grouped together and denoted as a new body. Bodies can be copied and pasted at arbitrary locations in the environment.

The following actions and their respective user commands are identified:

1. Create Action
	* Creation of simple geometric bodies, e.g. sphere, cuboid & pyramid.
	* "Create a blue sphere"
	* "Create a cube [deictic gesture] there"
2. Select Action
	* Referencing of geometric bodies via a description of its properties.
	* "Select the blue sphere"
	* "Select [deictic gesture] that cube"
3. Delete Action
	* Deletion of geometric bodies.
	* "Remove the blue sphere"
	* "Remove [deictic gesture] that cube"
4. Modify Action
	* Modification of a body's properties, e.g. position, scale & texture.
	* "Make it red"
	* "Move [deictic gesture] that sphere [deictic gesture] there"
	* "Make [deictic gesture] that sphere [iconic gesture] this big"
5. Group Action
	* Grouping of an arbitrary number of bodies to a new body.
	* "Group [deictic gesture] that pyramid and [deictic gesture] that cuboid"
6. Denote Action
	* Declaration of a new label for bodies.
	* "Call [deictic gesture] that body House"
7. Copy Action
	* Copy and Paste arbitrary bodies.
	* "Copy [deictic gesture] that house"
	* "Paste it [deictic gesture] there"

Use-Case Analysis

The showcased examples cover a broad variety of actions required in instruction-based interface scenarios. In order to design a fitting multimodal recognizer for the introduced commands, they have to be properly investigated. The following section analysis the occurring command sentences by defining a Context-Free Grammar and respective grammar parse trees (formalized by [Chomsky, 1964]).

A Context-Free Grammar is a commonly used mathematical system for modelling constituent structures in natural languages [Jurafsky and Martin, 2009]. It consists of terminal and nonterminal symbols. terminal symbols correspond to words in the language, e.g. “move”, “cube”, while nonterminal symbols express clusters of generalizations of terminal symbols. The part-of-speech tags of the penn treebank [Marcinkiewicz and Santorini, 1993] tag set are used as nonterminal symbols.

A command sentence begins with a verb in the imperative mood indicating the type of action the user wants to invoke. Thereafter the object(s) affected by the action are specified. In some cases additional information about how the effect affects the object(s) is provided. Therefor the actions are divided into two groups:

Actions 1. 2. 3. 5. can be represented as a simple verbal phrase (VP) and are denoted as simple commands for further reference. The VP consists of a verb (V) and a noun phrase (NP). The NP can consist of five different sets of nonterminal symbols, e.g. a determiner (DT), adjective (JJ) and noun (NN) or just a personal pronoun (PRP).

Grammar for simple commands:

S  -> VP
VP -> V NP
NP -> DT JJ NN  ("a blue cube")
	| DT NN     ("a sphere")
	| PRP       ("it")

Lexicon for simple commands:

V   -> create select remove group
DT  -> a the that those
JJ  -> small big blue yellow ...
NN  -> sphere cube pyramid
NNS -> spheres cubes pyramids
PRP -> it them

A simple parse tree for the command “create a blue cube.” looks like the following:

Figure 3: Parse tree for the command 'create a blue cube'.

Actions 4. 6. 7. are bit more complex and may contain additional information about the action, e.g. how big the cube should be made or where the sphere should be moved. Therefor, they are denoted as complex commands.

Grammar for complex commands:

S  -> VP
VP -> V NP JJ     ("make it yellow")
    | V NP DT JJ  ("make it this big")
    | V NP EX     ("move it there")
    | V NP NNP    ("Call it House")

	| DT NN
	| PRP


###Platform Simulator X (SimX) is a platform for software technology research in the area of intelligent Realtime Interactive Systems, targeted at Virtual Reality, Augmented Reality, Mixed Reality, and computer games [M. Latoschik & Tramberend, 2011]. Its implementation is based on the Entity-Component-System (ECS) pattern. Different aspects of a simulation, e.g. graphics, physics, or AI, are decoupled in individual components. Therefor, SimX’s core system utilizes the actor model, which supports fine grained concurrency and parallelism. SimX defines an entity as a collection of properties that describe an application object. The totality of all entities contained in the application represent the world state of the application.

In order to enhance non-functional requirements, i.e., to increase reusability, portability, maintainability and adaptability, SimX follows a semantic based approach [Wiebusch and Latoschik, 2012]. The Web Ontology Language (OWL) is used to define a set of symbols and associated data types, which are automatically transformed into programming language code and globally accessible in the application. This forms a semantically grounded database and is used as a common Knowledge Representation Layer (KRL) on a core level. Ultimately, this allows for an easy exchangeability of involved components and provides a general semantic access scheme to the application.

Abbildung 4: Semantic Values combine an actual state value with type description provided by an online ontology.

Simulator X adopts an object-centered world view based on an entity model. An Entity is defined as a collection of properties that describe an application object. A property is represented by a state variable (sVar) and contains so called semantic values (sVal). Semantic values are formed by combining an actual state value with a semantic type. Semantic types associate a grounded symbol and a respective data type, both defined in OWL, and are automatically transformed into programming language code [Wiebusch and Latoschik, 2015]. For example, the grounded symbol color in combination with the data type java.awt.color form the semantic type Color, which together with the actual state value java.awt.color.Blue sum up to a semantic value. This sVal can then be set as value into an entity’s state variable.

Figure 5: Entities merely collect references to state variables. The value management is delegated to a respective actor. The sVar __mass__ is managed by the physics actor, while the sVar __color__ is managed by the render actor.

SimX decouples storage control and access of property values. Entities merely collect references to sVars and delegate value management to the actor that controls the respective state variable. State variables are implemented using message passing of the underlying actor system, in order to provide a globally accessible world state and to create the illusion of consistently shared mutable state for the application programmer [M. Latoschik & Tramberend, 2011]. State variables can be read via special get() / observe() methods and written via the set() method.

Due to the fact that components, applications and their respective functionality are developed independently from each other there has to be at least one mediating component. Therefor, SimX introduces the WorldInterface, which allows to initialize handshaking processes at run-time [M. Latoschik & Tramberend, 2011]. It servers as a lookup service for state variables, entities, and actors using grounded symbols and semantic types.

miPro Extension

Figure 6: Class-Diagram of the implemented miPro extension.

The extension of the miPro framework consists of three main parts. (1) It utilizes an Augmented Transition Network (ATN) as fusion engine. The ATN is already configured with a predefined layout, capable of parsing multimodal speech and gesture input and recognizing the, in section Example Interaction showcased, interactions. The layout can easily be adapted, extended or rewritten (by p2), via a program language native Domain-Specific-Language (DSL). (2) Further, the extension introduces a local entity storage, which provides a convenient mechanism for accessing the world state directly from the fusion engine. The local entity storage is an extension to the already implemented functionality of SimX’s WorldInterface and adopts the semantic-based access scheme. (3) Lastly, the extension comprises action handling, which invokes functionality corresponding to the recognized user command. The action handling utilizes Simulator X’s planning component, which provides an interface for application developers (aDev) to define and register actions [Wiebusch, 2015]. These actions can be executed manually or by means of a planner. Actions are defined in a generalized manner and can be reused in various scenarios.


Primarily, the extension helps application developers (aDev) to enrich their applications with multimodal interaction techniques. Therefor, the developer has to start the UniversalInteraction actor and implement the functionality of the application with SimX’s actions. The actor can directly execute actions and start a planner in response to a recognized user command via the common KRL (defined by the ontology). Withal, the verb of the command is used to identify and request the respective action from the WorldInterface. References to meaningful objects (entities) in the application, e.g. via pointing gestures or speech, can directly be resolved by the fusion engine, due to its local world state representation.

Secondarily, an internal interface and a program code native Domain-Specific Language allows core developers (cDev) to adapt and/or extend the predefined ATN layout.

###Proof of Concept Application

A specific elaboration for the proof of concept application is still pending. The application should implement the in section Example Interaction defined actions and provide a multimodal speech-gesture interface using the miPro extension including a predefined Augmented Transition Network as fusion engine. In order to highlight the non-functional requirements like maintainability, especially reusability, a second proof of concept application, utilizing the implemented extension, will be developed.

##Literature Oviatt, S. & Cohen, P. (2000, March). Perceptual user interfaces: multimodal interfaces that process what comes naturally. Commun. ACM, 43(3), 45–53.

Latoschik, M. E. (2005). A User Interface Framework for Multimodal VR Interactions. In Proceedings of the 7th international conference on multimodal interfaces (pp. 76–83). Torento, Italy: ACM.

Fischbach, M. (2015). Software techniques for multimodal input processing in realtime interactive system. In Proceedings of the 2015 acm on international conference on multimodal interaction (pp. 623–627). ICMI’15. New York, NY, USA: ACM.

M. Latoschik and H. Tramberend. Simulator X: A Scalable and Con-current Architecture for Intelligent Realtime Interactive Systems. In Virtual Reality Conference (VR), 2011 IEEE, pages 171–174, 2011.

Jurafsky, D. & Martin, J. H. (2009). Speech and language processing. An introduction to natural language processing, computational linguistics, and speech recognition. (2.). Pearson Education.

Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993, June). Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19(2), 313–330.

Chomsky, N. (1964). Aspects of the theory of syntax. DTIC Document.

Wiebusch, D. & Latoschik, M. E. (2012). Enhanced decoupling of components in intelligent realtime interactive systems using ontologies. In Software engineering and architectures for realtime interactive systems (searis), proceedings of the ieee virtual reality 2012 workshop.

Wiebusch, D. & Latoschik, M. E. (2015). Decoupling the entity-component-system pattern using semantic traits for reusable realtime interactive systems. In IEEE VR workshop on software engineering and architectures for realtime interactive systems. IEEE VR.

Wiebusch, D. (2015). Reusability for Intelligent Realtime Interactive Systems (Doctoral dissertation, Universität Würzburg).

Branit, B. (2009, February 25). World builder (high quality). Accessed: 2016-06-29.

Contact Persons at the University Würzburg

Martin Fischbach (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg

Legal Information