| | |
| | |
Stat |
Members: 3665 Articles: 2'599'751 Articles rated: 2609
19 January 2025 |
|
| | | |
|
Article overview
| |
|
CPSP: Learning Speech Concepts From Phoneme Supervision | Chunyu Qiang
; Hao Li
; Yixin Tian
; Ruibo Fu
; Tao Wang
; Longbiao Wang
; Jianwu Dang
; | Date: |
1 Sep 2023 | Abstract: | For fine-grained generation and recognition tasks such as
minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic
speech recognition (ASR), the intermediate representation extracted from speech
should contain information that is between text coding and acoustic coding. The
linguistic content is salient, while the paralinguistic information such as
speaker identity and acoustic details should be removed. However, existing
methods for extracting fine-grained intermediate representations from speech
suffer from issues of excessive redundancy and dimension explosion.
Additionally, existing contrastive learning methods in the audio field focus on
extracting global descriptive information for downstream audio classification
tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these
issues, we propose a method named Contrastive Phoneme-Speech Pretraining
(CPSP), which uses three encoders, one decoder, and contrastive learning to
bring phoneme and speech into a joint multimodal space, learning how to connect
phoneme and speech at the frame level. The CPSP model is trained on 210k speech
and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The
proposed CPSP method offers a promising solution for fine-grained generation
and recognition downstream tasks in speech processing. We provide a website
with audio samples. | Source: | arXiv, 2309.00424 | Services: | Forum | Review | PDF | Favorites |
|
|
No review found.
Did you like this article?
Note: answers to reviews or questions about the article must be posted in the forum section.
Authors are not allowed to review their own article. They can use the forum section.
|
| |
|
|
|