Comparison of multimodal fusion methods for real-time interactive systems - towards a general testbed

This project is already completed.

Abstract

For almost forty years research has explored multimodal interaction with the goal to provide more natural concepts for human-computer interaction. In a phase of exploration a diversity of proof of concept applications and frameworks were developed providing multimodal interaction concepts. Since guidelines and standards were still missing, many of the applications lack modularity and modifiability. This conditions led to a low comparability between different multimodal fusion approaches. Hence research focused in user centered evaluation of frameworks as a whole. Consequently component-based comparisons are still missing. In particular comparative evaluations of different fusion engines, which constitute the core component of multimodal systems responsible for integrating multimodal commands, are still required. Therefore this work proposes a testbed for comparative evaluations of fusion engines and compares two different fusion approaches as proof of concept: 1. temporal augmented transition networks (tATN) and 2. unification.

Introduction

Research has explored multimodal interaction for almost forty years. Human-human and real world interactions are mimicked with the goal to provide more natural concepts for human-computer interaction. Multimodal interaction provides a broader bandwith of interaction with a computer system and leads to a higher efficiency of interaction [Cohen et al., 1998], faster task completion and reduced task-critical errors [Oviatt, 1997]. On the technical side multimodal fusion leads to higher system reliabilty and error recovery rate. Several modalities can compensate for each other’s errors [Oviatt, 1999]. On the basis of the above listed advantages of multimodal interfaces different approaches to multimodal fusion already have been developed or replicated and several frameworks for applying multimodal fusion exist (see related work). These multimodal systems consist of a collection of cooperating components. The main components are: a context manager tracking the user profile and context of use, a fission module providing user feedback through output modalities, a fusion engine responsible for the multimodal interpretation and a dialog manager which communicates results from a fusion engine to the application [Dumas, Ingold, & Lalanne, 2009; Dumas, Lalanne, & Oviatt, 2009]. Thus, a multitude of components needs to access the application state. Since the majority of frameworks lack a unified access scheme to the application state, idiosyncratic implementations with low reusability, modularity and modifiability are the result [Fischbach, Wiebusch, & Latoschik, 2016]. In some degree, this fact led to a situation where comparative evaluations of different multimodal fusion approaches are scarcely possible and still have to be carried out [Fischbach et al., 2016; Lalanne et al., 2009; Dumas, Ingold, & Lalanne, 2009]. Instead research has focused so far on wizard of oz and user experience evaluations or abstract analyses of fusion frameworks as a whole [Lalanne et al., 2009; Dumas, Ingold, & Lalanne, 2009; Dumas, Lalanne, & Oviatt, 2009]. Nevertheless speech and gesture recognition software and hardware are nowadays already widely available even in the consumer market. At the same time the importance of multimodal interaction rises. Especially the emerging market of 3-dimensioal virtual environments with output devices like Samsung Gear VR and Oculus Rift, where classical input methods are not applicable, demands for the multimodal usage of the available technology [Burdea, Richard, & Coiffet, 1996]. For this purpose optimization of software frameworks which enable multimodal interaction, especially its core component the fusion engine which integrates multimodal input data comes into focus. On this account component based evaluations are required more than ever to identify strength and limits of different multimodal fusion approaches, reveal error sources, provide indications how to improve fusion engines and selection criteria which fusion engine is best suited for a specific task [Fischbach et al., 2016; Lalanne et al., 2009; Dumas, Ingold, & Lalanne, 2009]. This work makes up for this lack providing a testbed focusing on the comparison of multimodal fusion engines. Different entry points for a comparative evaluation are proposed and a modular testbed allowing to interchange fusion engine components is provided. Additionally as a proof of concept the following two approaches are compared: 1. tAtn [Latoschik, 2002] and 2. Unification [Johnston et al., 1997].

Bolts Put-That-There [Bolt, 1980] was one of the first interfaces allowing to combine speech and pointing gestures to interact with a computer system. It layed the foundations for a phase of replication focusing on identifying issues of multimodal fusion. In the following phases first guidelines for building such a system capable of combining different modalities and novel or improved approaches for fusing modalities were developed. Leading to a diversity of approaches towards multimodal fusion, frameworks and applications [Lalanne et al., 2009]. Most approaches of multimodal integration have the following basic architecture in common (see fig. 2). The fusion engine as the core component interprets the data of several input modalities and recognizers. The multimodal integration by the fusion engine can be classified in three levels: 1. data level fusion, 2. feature level fusion and 3. decision level fusion. Decision level fusion is used for loosely coupled modes like gesture and speech. Therefore this type of multimodal fusion is highly relevant for hci. Requirements on fusion engines have already been extensively investigated [Lalanne et al., 2009; Dumas, Lalanne, & Oviatt, 2009]. Fusion engines should: be capable to handle probabilistic input and multiple possible modality combinations, respect temporal dependency and synchronicity is dependent on context and user, cope with errors e.g. user and recognition errors [Lalanne et al., 2009], need a common meaning representation of different modalities [Johnston et al., 1997] and a unified access scheme to application state is beneficial [Fischbach et al., 2016]. Aiming at reusable, flexible frameworks for multimodal integration several component-based plattforms already have been developed which allow for fast prototyping and partly interchangeable components have been developed e.g. HephaisTK [Dumas, Lalanne, & Ingold, 2009] and miPro [Fischbach et al., 2016]. Despite the existance of this frameworks there is still need in automated testbeds focussing on the fusion engine, performance, reliabilty, error handling ability and error categorization [Lalanne et al., 2009; Dumas, Ingold, & Lalanne, 2009]. Dumas et al. for example proposes a testbed based on metrics derived from the CARE properties which provide a formal description for multimodal interaction from the user perspective [Dumas, Ingold, & Lalanne, 2009]. This work pursues the ideas of Dumas et al. [Dumas, Ingold, & Lalanne, 2009] and uses the framework miPro [Fischbach et al., 2016] as a starting point for the development of a testbed providing comparable testing conditions. Since miPro provides a modular structure, components are interchangeable and decoupled [Fischbach et al., 2016]. Therefore its possible to keep all other components of the architecture constant while only changing the fusion engine.

Figure 1: Basic multimodal system architecture [Dumas, Lalanne, & Oviatt, 2009] extended with an evaluation concept (see section elicitation for details)

ELICITATION

A testbed determining strengths and limits of fusion engines relying on different multimodal fusion approaches has to fulfill the following conditions. In a multimodal system a variety of error sources are influencing the result of a multimodal interpretation [Dumas, Ingold, & Lalanne, 2009]. In contrast a comparative evaluation of fusion engines has to investigate each engine independent from the software environment conditions. Therefore to decrease confounding variables of the environment conditions such a testbed has to provide highly modular and decoupled components [Fischbach et al., 2016] allowing to interchange components while keeping the rest of the software environment constant. Furthermore the fusion engines should provide a high modifiability allowing to setup comparable test cases and testing conditions. Besides error sources in the software environment further main error sources originate from user input and recognizer errors. To track down misinterpretations of the fusion to their respective source a step-by-step approach is pursued following Dumas et al. [Dumas, Ingold, & Lalanne, 2009].

Analysis

The following continuum from controlled to realistic conditions is proposed. Five entry points for a comparative evaluation of the fusion engines are chosen (see fig. 1)

1. Modality Processor Output: To exclude the largest number of confounding variables like recognition errors of the sensors or user errors during a performance evaluation the input data from the sensors is simulated. Bypassing the modality processors is achieved through passing predefined output directly to the fusion engine. Through simulated output different use-cases for both fusion engines can be tested on the basis of the same data sets.

2. Recognizer Output: Predefined/recorded recognizer output is passed directly to the fusion engine still bypassing the error prone input recognizers while including the modality processors results into the evaluation process.

3. Recognizer: Speech and gesture input of users is recorded and played back to the recognizers. Including several recognizers in the evaluation process makes it possible to test the capability of a fusion engine to cope with recognition errors like delayed input or uncertainties.

4. User - controlled: The user is included in the evaluation process following a predefined protocol stating which multimodal commands should be uttered in a specific order and manner. As a consequence a more realistic evaluation is possible while keeping a controlled procedure and therefore the possibility for automated evaluation of the results.

*5. User - exploratory *: The free interaction of a user with the system in realtime provides the most realistic condition. Providing insights in the fusion engines capability of coping with different interaction patterns or redundant input e.g. data containing conversation with other person while using the system or unsupported and unexpected input.

This work for now focuses on entry point one and provides first results. In the future the following performance metrics for fusion engines (Dumas, Ingold, & Lalanne, 2009) can be collected for each entry point: To measure the effectiveness of a fusion engine for each simulated data set a corresponding data set containing the correct multimodal interpretation has to be provided serving as grounded truth. The comparison of the fusion engines output with the predefined grounded truth allows to measure the number of correct interpretations. For entry point five this grounded truth has to be defined manual and retrospective for each subject. Another performance measurement is the response time of the system. The time the fusion engine needs to process the incoming data until it delivers a result. Furthermore the confidence of the fusion engines interpretations are important references for identifying performance problems of fusion engines. In addition to the above mentioned performance metrics CPU and memory usage are measured. Providing a higher comparability between the two fusion engines by not favoring the unification approach which is expected to have a substantial higher memory usage.

Specification

The overall objective is planned to be achieved through the following milestones. The corresponding timeframe is stated in the envisaged timetable (see fig. 2).

Provide a multimodal fusion approach based on unification that utilizes Simulator X concepts for the representation of properties
1. Concept Design
  M1 Analysis of required classes with a conceptual run trough a simple example
2. Implementation
  M2 Provide a simple implementation of unification [not supporting conditions]
  M3 In code run through of M2 [predefined input, e.g. in one main]
  M4 Extension of the implementation to support conditions
  M5 In code run through of M4 [predefined input, e.g. in one main]
3. Integration into miPro
  M7 Run through M1, faking input as in M3
  M8 Run through with more general input
Provide a testbed for the comparison of multimodal fusion approaches
1. Requirements and workflow analysis
  M9 Analysis of required classes, creation of activity diagrams, identification of reusable features of Simulator X
2. Implementation of missing functionality
  M10 Conduct a comparison run trough using fake or unadapted fusion components [e.g. 2x ATN]
Conduct a comparative evaluation of the ATN and the unification approach
1. Definition of test cases
  M11 Creation of a document listing accepted utterances, including timings, modalities, etc.
2. Create test data set
  M12 Test data set and grounded truth
3. Preparation of test cases for the ATN component
  M13 Simple multimodal interface [ATN]
4. Preparation of test cases for the Unification component
  M14 Creation of a simple multimodal interface [Unification] and adaption of Unification definitions to replicate ATN structure/ supported commands, timing conditons, probabilistic calculations
5. Pretest
  M15 Conduct software tests and run through
6. Evaluation
  M16 Run through with test data M12 defined in M11
7. Thesis
  M17 Description of the testbed and interpretation of the comparison of ATN and Unification

Figure 2: Planned schedule

##Literatur

Bolt, R. A. (1980). Put-that-there: Voice and gesture at the graphics interface. [Vol. 14] [No. 3]. ACM.

Burdea, G., Richard, P., & Coiffet, P. (1996). Multimodal virtual reality: Input-output devices, system integration, and human factors. International Journal of Human-Computer Interaction, 8[1], 5–24.

Cohen, P. R., Johnston, M., McGee, D., Oviatt, S. L., Clow, J., & Smith, I. A. (1998). The efficiency of multimodal interaction: a case study. In Icslp.

Dumas, B., Ingold, R., & Lalanne, D. (2009). Benchmarking fusion engines of multimodal interactive systems. In Proceedings of the 2009 international conference on multimodal interfaces [pp. 169–176].

Dumas, B., Lalanne, D., & Ingold, R. (2009). Hephaistk: a toolkit for rapid prototyping of multimodal interfaces. In Proceedings of the 2009 international conference on multimodal interfaces [pp. 231–232].

Dumas, B., Lalanne, D., & Oviatt, S. (2009). Multimodal interfaces: A survey of principles, models and frameworks. In Human machine interaction [pp. 3–26]. Springer.

Fischbach, M., Wiebusch, D., & Latoschik, M. E. (2016). Semantics-based software techniques for maintainable multimodal input processing in real-time interactive systems. IEEE Computer Society.

Johnston, M., Cohen, P. R., McGee, D., Oviatt, S. L., Pittman, J. A., & Smith, I. (1997). Unification-based multimodal integration. In Proceedings of the eighth conference on european chapter of the association for computational linguistics [pp. 281–288].

Lalanne, D., Nigay, L., Robinson, P., Vanderdonckt, J., Ladry, J.- F., et al. (2009). Fusion engines for multimodal input: a survey. In Proceedings of the 2009 international conference on multimodal interfaces [pp. 153–160].

Latoschik, M. E. (2002). Designing transition networks for multimodal vr-interactions using a markup language. In Proceedings of the 4th ieee international conference on multimodal interfaces [p. 411-416]. ACM.

Oviatt, S. (1997, March). Multimodal interactive maps: Designing for human performance. Hum.-Comput. Interact., 12[1], 93–

Oviatt, S. (1999). Mutual disambiguation of recognition errors in a multimodel architecture. In Proceedings of the sigchi conference on human factors in computing systems [pp. 576–583]. New York, NY, USA: ACM.

Contact Persons at the University Würzburg

Martin Fischbach, M.Sc. (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg
martin.fischbach@uni-wuerzburg.de