Interaction Design and Processing for Data Collection in a Multimodal Environment

This project is already assigned.

Motivation

In the field of Multimodal Interfaces (MMI), combining speech and gesture inputs provides interesting aspects to study. Current research predominantly focuses on fusing the modalities with descriptive models, by programmatically defining valid input combinations comprising multiple modalities and mapping them to the corresponding action within the application. Due to the nature of this “fixed” implementation these descriptive models impose strong limitations in regards to their flexibility for user input (e.g. minor variations in user utterances).

In light of these limitations there has been a noticeable shift towards machine learning-based approaches [1]. Particularly those leveraging Large Language Models (LLMs) have a more promising outlook and mitigate the downsides of descriptive models, as the ability of LLMs to encode contextual knowledge provides a great foundation at interpreting multimodal input in dynamic environments.

This trend amplifies the need for a benchmarking suite to evaluate LLM-based fusion methods and their efficiency. Benchmarking such applications in a standardized and automated manner can enable the deployment of robust multimodal systems and faster innovation in this domain.

The planned internship expands on a project developed during the MMI course. Two students will build upon the knowledge of this prototype to develop a user environment suitable for recording data to be used in a benchmark dataset. The development of the user environment especially focuses on the design of user interactions and instruction elements. The environment will then be combined with a data recorder developed by another intern at the HCI chair. This will ultimately allow the creation of a dataset which can be used for benchmarking different fusion methods.

Scope of the Internship

The internship focuses on designing the user interactions and instruction elements of a VR application such that it can be used to record and collect consistent data for the creation of a benchmark dataset. The overarching goals that are worked towards are:

Designing the User Environment:
- Designing a user environment that encompasses the most common uni- and multimodal speech and gesture interactions (translate, rotate, scale, create, destroy, and select [2]), focusing on easy-to-understand user guidance and intuitive interaction possibilities, based on the prototype developed during the MMI course.
Designing the Study Procedure:
- Develop a concept for the user study to guide participants through interaction tasks while allowing for consistent data collection.
Implementing the Environment:
- Implement the designed VR environment and its user interactions in Unity. The processing of input data and environment logic should be structured as encapsulated modules, such that different fusion methods can be integrated into the application via defined interfaces. Iterative testing to improve user experience, especially for first time users.
Integration of the Data Recorder:
- Integrate the data recorder designed by the fellow intern into the application. Additionally, implement the possibility to playback the recorded data using the recorder’s replay functionality. Verify that the environment results in a consistent state after replay.
Study Data Collection:
- Record and collect data through user studies feasible for use in a dataset to benchmark different fusion methods.

Tasks and Planned Progression

Literature review and research familiarization
Design and Conceptualization
- together:
  - rough environment
  - user interactions and application logic + states
  - study procedure and additional control possibilities for study conductor
- Annika:
  - user interaction visualization and feedback
  - interactive user guidance and affordances
- Erik:
  - generalized data fusion and semantic integration
  - exemplary LLM-based fusion method
Prototype implementation and thorough testing
- responsibilities see above
Dataset creation for benchmarking from data collected in the developed implementation through user studies
Implementation of a supplementary fusion method (optional)
- if time allows, conduct exemplary benchmarking of the two methods using the created dataset
Documentation of progress and insights

Insights into the Work at the HCI Chair

In addition to contributing to ongoing research at the Human-Computer Interaction (HCI) Chair, the internship will include potential involvement in organizational or technical support tasks supporting the chair’s broader research infrastructure.

Schedule

References

[1] Ronja Heinrich, Chris Zimmerer, Martin Fischbach, and Marc Erich Latoschik. 2025. A Systematic Review of Fusion Methods for the User-Centered Design of Multimodal Interfaces. In Proceedings of the 27th International Conference on Multimodal Interaction (ICMI ‘25). Association for Computing Machinery, New York, NY, USA, 485–495. https://doi.org/10.1145/3716553.3750790

[2] Adam S. Williams and Francisco R. Ortega. 2020. Understanding Gesture and Speech Multimodal Interactions for Manipulation Tasks in Augmented Reality Using Unconstrained Elicitation. Proc. ACM Hum.-Comput. Interact. 4, ISS, Article 202 (November 2020), 21 pages. https://doi.org/10.1145/3427330

Contact Persons at the University Würzburg

Ronja Heinrich (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
ronja.heinrich@uni-wuerzburg.de

Dr. Martin Fischbach (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
martin.fischbach@uni-wuerzburg.de

Prof. Dr. Marc Erich Latoschik
Human-Computer Interaction, Universität Würzburg
marc.latoschik@uni-wuerzburg.de