| | |
| | |
Stat |
Members: 3665 Articles: 2'599'751 Articles rated: 2609
19 January 2025 |
|
| | | |
|
Article overview
| |
|
CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding | Étienne Labbé
; Thomas Pellegrini
; Julien Pinquier
; | Date: |
1 Sep 2023 | Abstract: | Automated Audio Captioning (AAC) involves generating natural language
descriptions of audio content, using encoder-decoder architectures. An audio
encoder produces audio embeddings fed to a decoder, usually a Transformer
decoder, for caption generation. In this work, we describe our model, which
novelty, compared to existing models, lies in the use of a ConvNeXt
architecture as audio encoder, adapted from the vision domain to audio
classification. This model, called CNext-trans, achieved state-of-the-art
scores on the AudioCaps (AC) dataset and performed competitively on Clotho
(CL), while using four to forty times fewer parameters than existing models. We
examine potential biases in the AC dataset due to its origin from AudioSet by
investigating unbiased encoder’s impact on performance. Using the well-known
PANN’s CNN14, for instance, as an unbiased encoder, we observed a 1.7% absolute
reduction in SPIDEr score (where higher scores indicate better performance). To
improve cross-dataset performance, we conducted experiments by combining
multiple AAC datasets (AC, CL, MACS, WavCaps) for training. Although this
strategy enhanced overall model performance across datasets, it still fell
short compared to models trained specifically on a single target dataset,
indicating the absence of a one-size-fits-all model. To mitigate performance
gaps between datasets, we introduced a Task Embedding (TE) token, allowing the
model to identify the source dataset for each input sample. We provide insights
into the impact of these TEs on both the form (words) and content (sound event
types) of the generated captions. The resulting model, named CoNeTTE, an
unbiased CNext-trans model enriched with dataset-specific Task Embeddings,
achieved SPIDEr scores of 44.1% and 30.5% on AC and CL, respectively. Code
available: this https URL. | Source: | arXiv, 2309.00454 | Services: | Forum | Review | PDF | Favorites |
|
|
No review found.
Did you like this article?
Note: answers to reviews or questions about the article must be posted in the forum section.
Authors are not allowed to review their own article. They can use the forum section.
|
| |
|
|
|