Task: predict sentiment from a video clip
"This is great" (words = positive)
[flat tone] (audio = neutral)
[frown] (face = negative)
This is a case of unique information in the non-verbal channels and synergistic interaction across all three modalities.
(Liang et al., 2024 — taxonomy)
| # | Challenge | Core Question |
|---|---|---|
| 1 | Representation | How do we encode and combine multimodal data? |
| 2 | Alignment | How do we link elements across modalities? |
| 3 | Reasoning | How do we compose multimodal knowledge? |
| 4 | Generation | How do we produce coherent multimodal output? |
| 5 | Transference | How do we transfer knowledge across modalities? |
| 6 | Quantification | How do we measure heterogeneity and interactions? |
Three sub-challenges:
Fusion — integrate two or more modalities into a joint representation
Coordination — keep modalities separate but align them in a shared space
Fission — decompose into disjoint factors (modality-specific + shared)
Additive (late/ensemble fusion):
Multiplicative Interactions (MI):
Additive = first-order polynomial; MI = second-order polynomial.
Cross-term captures moderation: modality 1 affects how modality 2 relates to the label.
Goal: matched pairs → close in embedding space; unmatched pairs → far.
Examples: CLIP (image text), wav2BerT (audio
text).
Goal: separate modality-specific from shared information.
Input: [audio, text, video]
Output: z_shared ← emotion signal present in all
z_audio ← speaker-specific prosody
z_text ← linguistic content
z_video ← facial muscle movements
Fission enables interpretability and fine-grained control (e.g., swap only the style of a modality).
How do we find which elements across modalities correspond?
| Type | Description | Method |
|---|---|---|
| Discrete | Given word tokens and image regions, find matches | Attention, optimal transport |
| Continuous | Align continuous signals (speech waveform |
Dynamic time warping, clustering |
| Contextualized | Learn representations that incorporate cross-modal context | Multimodal Transformer (MULT, CLIP, Flamingo) |
Alignment difficulty: long-range dependencies, ambiguous segmentation, many-to-many mappings.
Composing multimodal knowledge through multiple inference steps.
Example: Visual Question Answering requires grounding the question in the image and reasoning over object relations.
Producing multimodal output that is coherent and consistent.
| Sub-task | Input → Output | Example |
|---|---|---|
| Summarization | Video → text | News video summary |
| Translation | Text → image | DALL-E, Stable Diffusion |
| Creation | Text prompt → video + audio | Synchronized multimedia generation |
Key challenge: evaluation — how do we measure cross-modal coherence? Ethical risks: deepfakes, bias amplification.
Transfer knowledge from a resource-rich modality to a resource-poor one.
Motivation: text and images have huge pretraining corpora; audio, haptics, physiological signals do not.
Empirically and theoretically understanding multimodal learning.
(Obrist & Velasco, 2025 — Communications of the ACM)
Moving from the ML taxonomy to the design perspective:
"We call this fusion of carefully crafted sensory elements within specific events to form a desired impression a multisensory experience."
Definition (Obrist & Velasco, 2025):
A structured fusion of sensory elements across events, shaped for a specific receiver to produce a desired impression.
Four conceptual components:
| Component | Description |
|---|---|
| Impression | The desired overall effect or meaning to be conveyed |
| Event | A specific episode or moment within the experience journey |
| Sensory elements | The individual stimuli (visual, auditory, tactile, olfactory, gustatory) |
| Receiver | The person experiencing — shaped by context, background, and state |
A multisensory experience is not a single event — it is a journey:
Pre-encounter → Primary event → Post-encounter
(expectation) (sensation + (memory +
perception + reflection)
cognition +
emotion)
A coffee brand experience starts before you open the package and continues long after the last sip.
When designing multisensory experiences, four perceptual concepts guide decisions:
| Concept | Meaning | Example |
|---|---|---|
| Sensory congruence | Stimuli feel consistent across modalities | Dubbed film feels "off" — audio ≠ lip movement |
| Crossmodal correspondences | Systematic associations between features across senses | Round shapes |
| Sensory dominance | One sense overrides others at specific moments | Vision dominates spatial judgements (ventriloquist effect) |
| Sensory overload | Too much stimulation overwhelms processing | Loud music + strong smell + flashing lights |
These principles directly inform the alignment and interaction challenges in ML.
Sensory elements do not simply add up — they form an interconnected network:
Analogous to synergy in ML interaction theory: new information only emerges from the combination.
A multisensory experience is always relative to its receiver:
Machine learning systems must similarly account for user heterogeneity — personalization, fairness, and accessibility are not optional extras.
Technology enables multisensory experiences along a reality–virtuality continuum:
| Mode | Description | Example |
|---|---|---|
| Physical | Real sensory stimuli in real environments | Live concert, food tasting |
| Augmented Reality | Digital overlays on physical world | AR flowers with information |
| Mixed Reality | Physical objects + digital environment | Eating real food in VR colour room |
| Virtual Reality | Fully digital, with haptic/olfactory devices | Dark matter dome experience |
(Obrist & Velasco, 2025)
Impression: alter taste perception without changing the food itself
Event: participants eat real food while wearing a VR headset
Sensory elements: real food (taste + smell) + VR ambient colour + VR food shape
Receiver: adults, no sensory impairments, UK residents
Crossmodal correspondences used:
Computational parallel: exploiting learned crossmodal associations (semantic correspondence) to influence perception — analogous to cross-modal transfer in ML.
(Obrist & Velasco, 2025)
Impression: make an invisible, abstract concept perceivable and emotionally engaging
Event: science museum dome installation, Great Exhibition Road Festival 2019
Sensory elements: scent + spatial audio + cosmic visuals + haptic vibrations — synchronized by a central computer
Receiver: museum visitors of varied scientific backgrounds
Key insight: sensory substitution can make the imperceptible perceivable.
Obrist & Velasco (2025) propose three laws for multisensory experiences, inspired by Asimov's robotics laws:
These map directly to AI ethics: safety, fairness, and explainability — now extended to the sensory domain. The AREA framework (Anticipate, Reflect, Engage, Act) is proposed for responsible innovation in multisensory design.
Sensory-substitution devices enable people with sensory impairments to access multisensory experiences:
Inclusive design considerations:
Connection to ML: fairness and robustness across user subgroups require the same inclusive mindset — consider representation in training data, not just architecture.
MultiBench (Liang et al., NeurIPS 2021) standardises multimodal research. Three evaluation axes:
| Scope | # |
|---|---|
| Datasets | 15 |
| Modalities | 10 |
| Prediction tasks | 20 |
| Research areas | 6 |
| Domain | Dataset | Modalities | Task |
|---|---|---|---|
| Affective | CMU-MOSI/MOSEI | Text + Audio + Video | Sentiment |
| Affective | MUStARD | Text + Audio + Video | Sarcasm |
| Healthcare | MIMIC | Static + Time-series | Mortality / ICD9 |
| Robotics | Vision & Touch | Image + Force | Contact detection |
| Robotics | MuJoCo Push | Image + Proprioception | Object pose |
| Finance | Stocks | Multiple stock series | Price prediction |
| HCI | ENRICO | Screenshot + Layout | UI category |
| Multimedia | AV-MNIST | Image + Audio | Digit recognition |
| Multimedia | MM-IMDb | Image + Text | Genre classification |
MultiBench provides one composable architecture:
class MMDL(nn.Module):
def __init__(self, encoders, fusion, head):
self.encoders = nn.ModuleList(encoders) # one per modality
self.fuse = fusion # fusion module
self.head = head # task head
def forward(self, inputs):
reps = [enc(x) for enc, x in zip(self.encoders, inputs)]
fused = self.fuse(reps)
return self.head(fused)
Mix and match: swap any encoder, any fusion module, any task head.
Early Fusion: concatenate raw or feature before learning:
Simple, captures low-level interactions, one modality can dominate.
Late Fusion: train unimodal predictors, aggregate predictions:
Each modality trains at its own pace, no interactions
Tensor Fusion (Zadeh et al., 2017) captures all higher-order interactions via outer product:
Appending ensures lower-order terms are included.
For 3 modalities of size :
Expressive but expensive — dimensionality grows multiplicatively.
Low-Rank Tensor Fusion (2018) factorises the tensor weight:
Rank → drastically fewer parameters.
class LowRankTensorFusion(nn.Module):
# Projects each modality into rank-R factors
# then combines via element-wise product and sum
Same expressivity target as full tensor fusion, but tractable at scale.
MULT uses directional cross-modal attention:
For source modality and target modality :
Each token in attends to all tokens in at every time step — captures long-range cross-modal dependencies without explicit alignment.
Complexity: Peak memory, number of parameters, training and inference time.
Robustness
Plotting both reveals the accuracy–robustness tradeoff: some fusion methods are accurate but brittle; others are robust but weaker.
From Perception to Processing: Human principles (temporal, spatial, inverse effectiveness) alignment, robustness, transference in ML
Three Foundational Principles (Liang et al., 2024): Heterogeneity · Connections · Interactions
Six Core Challenges: Representation → Alignment → Reasoning → Generation → Transference → Quantification
Designing Multisensory Experiences (Obrist & Velasco, 2025):
Practice (MultiBench / MultiZoo):
``` maller text in Marp: ```markdown <small>**Sources:** Liang, Zadeh & Morency (2024), *Foundations & Trends in MML*, ACM Computing Surveys; Obrist & Velasco (2025), *Multisensory Experiences*, CACM; MultiBench/MultiZoo (NeurIPS 2021, JMLR 2022).</small> ``` Or using CSS: ```markdown <style scoped> p { font-size: 0.75em; } </style> **Sources:** Liang, Zadeh & Morency (2024), *Foundations & Trends in MML*, ACM Computing Surveys; Obrist & Velasco (2025), *Multisensory Experiences*, CACM; MultiBench/MultiZoo (NeurIPS 2021, JMLR 2022).