Research Article

Multimodal Fusion Method Based on Self-Attention Mechanism

Figure 1

Overview of our multimodal fusion model based on self-attention mechanism: the unimodal representations , , and as input to MF (multimodal fusion), which were obtained by passing the unimodal inputs , , and into three subnetworks , , and , respectively. In MF, , , and generate new unimodal representations , , and through self-attention; then, , , and produce an output representation by performing low-rank multimodal fusion with modality-specific factors. The output will be multimodal representation, which can be used for applying classification task.