Recording and Benchmarking Environment for Multimodal Fusion

This project is already assigned.

Motivation

Multimodal interfaces (MMIs) enable intuitive and natural interaction by combining complementary modalities such as speech, gesture, touch or gaze (Oviatt & Cohen 2000). In Virtual and Augmented Reality environments, multimodality has been shown to be especially effective for instruction-based input, as users can convey semantic and spatial information simultaneously.
The combination of speech and gesture is regarded as particularly effective due to the strong complementarity of the two modalities (McNeill & Duncan, 2000). However, the integration of these modalities, referred to as multimodal fusion, remains a central challenge in Human-Computer Interaction (HCI).

Current systems still rely mainly on declarative late fusion, where modality outputs are combined at a semantic level. These methods are transparent and suitable for rapid prototyping but remain limited in flexibility and scalability.
Machine-learning-based fusion approaches show increasing potential, especially due to their flexibility in input, yet their broader adoption is hindered by data scarcity, annotation effort, and limited transferability across domains.
In addition, there is still no standardized toolkit or benchmark infrastructure for the systematic evaluation and reuse of fusion methods.

The systematic review by Heinrich et al. identifies three key research gaps:

The continued dominance of rule-based late fusion methods
The limited accessibility of machine learning-based fusion approaches
The absence of reusable toolkits and benchmarking frameworks

To advance toward more reproducible, user-centered, and data-driven multimodal systems, a comprehensive benchmarking environment is required.
Such an environment must allow for consistent data collection, replay and quantitative evaluation of different fusion methods under comparable conditions.
The planned internship directly contributes to this need by developing a recording and playback environment as a core component of a benchmarking suite for multimodal fusion at the HCI Chair.

Scope of the Internship

The internship focuses on developing the recording and benchmarking components for reproducible testing and evaluation of multimodal fusion methods. The following work packages outline the development of the recording component and its integration into the broader benchmarking framework:

Recording and Playback Environment:
Development of a modular Unity system for recording and replaying multimodal input streams (e.g. speech transcriptions, gesture data, timestamps, and application states).
The component enables synchronized recording, structured dataset export, and reproducible replay to provide consistent conditions for later benchmarking. This constitutes the main development focus of the internship.

The recorder developed in this internship will be integrated into a recording environment created by fellow interns, who focus on the design of interaction and instruction elements.
Creation of Multimodal Datasets:
The recording environment will serve as the basis for creating structured and reusable datasets.
These datasets form the foundation for later benchmarking and for training data-driven fusion approaches.
Identification of Performance Metrics:
Definition and, if feasible, implementation of suitable quantitative and qualitative metrics to assess multimodal fusion performance (e.g. accuracy, latency, reusability, user-perceived coherence).
Conceptualization (and potential implementation) of Fusion Methods:
Exploration of fusion strategies with an emphasis on late descriptive methods and data-driven alternatives (e.g. LLM-assisted fusion).
This includes outlining conceptual architectures, defining their input-output mappings and preparing them for comparative evaluation within the benchmarking framework.
Documentation and Integration:
Comprehensive documentation of the implemented components, including architecture, configuration, and usage guidelines to ensure transparency, reusability, and accessibility for future research projects.

Insights into Work at the HCI Chair

The internship will provide in-depth insights into the research activities at the Chair for Human-Computer Interaction, including participation in weekly research meetings and potential involvement in organizational or technical support tasks.

Schedule

References

Heinrich, R., Zimmerer, C., Fischbach, M., & Erich Latoschik, M. (2025). A Systematic Review of Fusion Methods for the User-Centered Design of Multimodal Interfaces. Proceedings of the 27th International Conference on Multimodal Interaction, 485–495.

McNeill, D., & Duncan, S. D. (2000). Growth points in thinking-for-speaking. In D. McNeill (Ed.), Language and Gesture (pp. 141–161). Cambridge University Press.

Oviatt, S., & Cohen, P. (2000). Perceptual user interfaces: Multimodal interfaces that process what comes naturally. Communications of the ACM, 43(3), 45–53.

Contact Persons at the University Würzburg

Ronja Heinrich (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
ronja.heinrich@uni-wuerzburg.de

Dr. Martin Fischbach (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
martin.fischbach@uni-wuerzburg.de

Prof. Dr. Marc Erich Latoschik
Human-Computer Interaction, Universität Würzburg
marc.latoschik@uni-wuerzburg.de