| | |
| | |
Stat |
Members: 3645 Articles: 2'501'711 Articles rated: 2609
20 April 2024 |
|
| | | |
|
Article overview
| |
|
Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition | Zi-Qiang Zhang
; Jie Zhang
; Jian-Shu Zhang
; Ming-Hui Wu
; Xin Fang
; Li-Rong Dai
; | Date: |
15 Feb 2022 | Abstract: | With the advance in self-supervised learning for audio and visual modalities,
it has become possible to learn a robust audio-visual speech representation.
This would be beneficial for improving the audio-visual speech recognition
(AVSR) performance, as the multi-modal inputs contain more fruitful information
in principle. In this paper, based on existing self-supervised representation
learning methods for audio modality, we therefore propose an audio-visual
representation learning approach. The proposed approach explores both the
complementarity of audio-visual modalities and long-term context dependency
using a transformer-based fusion module and a flexible masking strategy. After
pre-training, the model is able to extract fused representations required by
AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g.
audio/visual speech recognition by simply masking out one modality in the
fusion module. The proposed pre-trained model is evaluated on speech
recognition and lipreading tasks using one or two modalities, where the
superiority is revealed. | Source: | arXiv, 2202.07428 | Services: | Forum | Review | PDF | Favorites |
|
|
No review found.
Did you like this article?
Note: answers to reviews or questions about the article must be posted in the forum section.
Authors are not allowed to review their own article. They can use the forum section.
browser Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
|
| |
|
|
|
| News, job offers and information for researchers and scientists:
| |