Human-AI Interaction:

Modality	Sensor	Example
Speech / Audio	Microphone	Spoken words, tone of voice
Image / Video	Camera	Facial expressions, gestures
Text	Keyboard	Transcripts, captions
Touch / Haptics	Force sensor	Grip pressure, surface texture
Physiological	EEG, ECG	Heart rate, brain activity

Human Principle	Machine Challenge
Temporal principle — stimuli integrated when co-occurring	Continuous alignment (dynamic time warping, segmentation)
Spatial principle — stimuli integrated when co-located	Discrete alignment (grounding, region-word matching)
Inverse effectiveness — weak unimodal → strong integration	Robustness and transference to low-resource modalities
McGurk effect — AV synergy creates new percept	Modality interactions (synergy, emergence)
Sensory dominance — one sense overrides others	Modality bias and quantification of task relevance

Principle	Question
Heterogeneity	How different are the modalities from each other?
Connections	How are elements across modalities related?
Interactions	How do modalities jointly influence the task?

Dimension	Description	Example
Element	Basic unit of the modality	Pixel vs. word vs. Hz sample
Distribution	Frequency of elements	Words follow Zipf's law
Structure	How elements compose	Spatial (image) vs. sequential (audio)
Information	Total content per modality	Dense video vs. sparse label
Noise	Type of corruption	Occlusion (vision) vs. background noise (audio)
Task relevance	How useful the modality is	Sarcasm: tone > literal words

Type	Meaning	Example
Redundancy	Both modalities carry same info	Face + voice both signal sadness
Uniqueness	One modality carries info the other lacks	Sarcasm: tone says "no", words say "yes"
Synergy	New info only emerges together	McGurk effect: AV beat audio or video alone

#	Challenge	Core Question
1	Representation	How do we encode and combine multimodal data?
2	Alignment	How do we link elements across modalities?
3	Reasoning	How do we compose multimodal knowledge?
4	Generation	How do we produce coherent multimodal output?
5	Transference	How do we transfer knowledge across modalities?
6	Quantification	How do we measure heterogeneity and interactions?

Type	Description	Method
Discrete	Given word tokens and image regions, find matches	Attention, optimal transport
Continuous	Align continuous signals (speech waveform motion capture) without segmentation boundaries	Dynamic time warping, clustering
Contextualized	Learn representations that incorporate cross-modal context	Multimodal Transformer (MULT, CLIP, Flamingo)

Sub-task	Input → Output	Example
Summarization	Video → text	News video summary
Translation	Text → image	DALL-E, Stable Diffusion
Creation	Text prompt → video + audio	Synchronized multimedia generation

Component	Description
Impression	The desired overall effect or meaning to be conveyed
Event	A specific episode or moment within the experience journey
Sensory elements	The individual stimuli (visual, auditory, tactile, olfactory, gustatory)
Receiver	The person experiencing — shaped by context, background, and state

Concept	Meaning	Example
Sensory congruence	Stimuli feel consistent across modalities	Dubbed film feels "off" — audio ≠ lip movement
Crossmodal correspondences	Systematic associations between features across senses	Round shapes sweet taste; angular shapes bitter/sour
Sensory dominance	One sense overrides others at specific moments	Vision dominates spatial judgements (ventriloquist effect)
Sensory overload	Too much stimulation overwhelms processing	Loud music + strong smell + flashing lights

Mode	Description	Example
Physical	Real sensory stimuli in real environments	Live concert, food tasting
Augmented Reality	Digital overlays on physical world	AR flowers with information
Mixed Reality	Physical objects + digital environment	Eating real food in VR colour room
Virtual Reality	Fully digital, with haptic/olfactory devices	Dark matter dome experience

Inclusive Multisensory Design

Sensory-substitution devices enable people with sensory impairments to access multisensory experiences:

"Seeing through sound" — visual information encoded as auditory signals
Tactile storytelling ("Touch the Story") — narrative engagement through haptic feedback

Inclusive design considerations:

Design for diversity from the start, not as an afterthought
Richer experiences for all when designed with diverse receivers in mind
Addresses digital divide and equitable access to multisensory technologies

Connection to ML: fairness and robustness across user subgroups require the same inclusive mindset — consider representation in training data, not just architecture.

Scope	#
Datasets	15
Modalities	10
Prediction tasks	20
Research areas	6

Domain	Dataset	Modalities	Task
Affective	CMU-MOSI/MOSEI	Text + Audio + Video	Sentiment
Affective	MUStARD	Text + Audio + Video	Sarcasm
Healthcare	MIMIC	Static + Time-series	Mortality / ICD9
Robotics	Vision & Touch	Image + Force	Contact detection
Robotics	MuJoCo Push	Image + Proprioception	Object pose
Finance	Stocks	Multiple stock series	Price prediction
HCI	ENRICO	Screenshot + Layout	UI category
Multimedia	AV-MNIST	Image + Audio	Digit recognition
Multimedia	MM-IMDb	Image + Text	Genre classification

Human-AI Interaction:

Multimodal Information Processing

Overview

What is a Modality?

Why Multimodal?

From Human Perception to Machine Processing

Bridging Human Principles & Computation

Three Foundational Principles

Principle 1 — Modality Heterogeneity

Principle 2 — Modality Connections

Principle 3 — Modality Interactions

Interactions: Concrete Example

Six Core Technical Challenges

Challenge 1 — Representation

Representation Fusion: Additive & Multiplicative

Repr. Coordination: Contrastive Learning

Representation Fission — Disentanglement

Challenge 2 — Alignment

Challenge 3 — Reasoning

Challenge 4 — Generation

Challenge 5 — Transference

Challenge 6 — Quantification

Designing Multisensory Experiences

What Makes a Multisensory Experience?

The Experience Journey

Sensory Congruence and Key Concepts

The Multisensory Ecosystem

The Receiver — Individual Differences

Where the Senses Meet Technology

Example 1 — Multisensory Eating in VR

Example 2 — Dark Matter Experience

Responsibilities — Three Laws

Inclusive Multisensory Design

MultiBench: A Unified Benchmark

MultiBench Datasets

MMDL: The Unified Architecture

Fusion Methods: Early & Late

Tensor Fusion

Low-Rank Tensor Fusion

Multimodal Transformer (MulT)

MultiBench Evaluation

Open Questions (Liang et al., 2024>9.1)

Summary