Captioning is the intermodal translation task of describing the human-perceived information in a medium, e.g. images (image captioning) or audio (audio captioning), using free text [1, 2, 3, 4]. In particular, audio captioning was first introduced in [4], it does not involve speech transcription, and is focusing on identifying the human-perceived information in an general audio signal and expressing it through text, using natural language. This information includes identification of sound events, acoustic scenes, spatiotemporal relationships of sources, foreground versus background discrimination, concepts, and physical properties of objects and environment. For example, given an audio signal, an audio captioning system would be able to generate captions like “a door creaks as it slowly revolves back and forth”
The dataset used for training an audio captioning method defines to a great extent what the method can learn [1, 5]. Diversity in captions allows the method to learn and exploit the perceptual differences on the
content (e.g. a thin plastic rattling could be perceived as a fire crackling) [1]. Also, the evaluation of the method becomes more objective and general by having more captions per audio signal [5].
Recently, two different datasets for audio captioning were presented, Audio Caption and AudioCaps [6, 7]. Audio Caption is partially released, and contains 3710 domain-specific (hospital) video clips with their audio tracks, and annotations that were originally obtained in Mandarin Chinese and afterwards translated to English using machine translation [6]. The annotators had access and viewed the videos. The annotations contain description of the speech content (e.g. “The patient inquired about the location of the doctors police station”). AudioCaps dataset has 46 000 audio samples from AudioSet [8], annotated with one caption each using the crowdsourcing platform Amazon Mechanical Turk (AMT) and automated quality and location control of the annotators [7]. Authors of AudioCaps did not use categories of sounds which they claimed that visuals were required for correct recognition, e.g. “inside small room”. Annotators of AudioCaps were provided the word labels (by AudioSet) and viewed the accompa-
nying videos of the audio samples.
The perceptual ambiguity of sounds can be hampered by providing contextual information (e.g. word labels) to annotators, making them aware of the actual source and not letting them describe their own perceived information. Using visual stimuli (e.g. video) introduces a bias, since annotators may describe what they see and not what they hear. Also, a single caption per file impedes the learning and evaluation of diverse descriptions of information, and domain-specific data of previous audio captioning datasets have an observed significant impact on the performance of methods [6]. Finally, unique words (i.e. words appearing only once) affect the learning process, as they have an impact on the evaluation process (e.g. if a word is unique, will be either on training or on evaluation). An audio captioning dataset should at least provide some information on unique words contained in its captions.
In this paper we present the freely availabledio captioning dataset Clotho
, with 4981 audio samples and 24 905 captions. All audio samples are from Freesound platform [9], and are of duration from 15 to 30 seconds. Each audio sample has five captions of eight to 20 words length, collected by AMT and a specific protocol for crowdsourcing audio annotations, which ensures diversity and reduced grammatical errors [1]. During annotation no other information but the audio signal was available to the annotators, e.g. video or word tags. The rest of the paper is organized as follows. Section 2 presents the creation of Clotho, i.e. gathering and processing of the audio samples and captions, and the splitting of the data to development, evaluation, and testing splits. Section 3 presents the baseline method used, the process followed for its evaluation using Clotho, and the obtained results. Section 4 concludes the paper.
2.1 Audio data collection and processing
We collect the set of audio samples with
data (e.g. tags that indicate their content, and a short textual description), from the online platform Freesound [9].
was obtained by randomly sampling audio files from Freesound fulfilling the following criteria: lossless file type, audio quality at least 44.1 kHz and 16-bit, duration 10 s
s (where d(x) is the duration of x), a textual description which first sentence does not have spelling errors according to US and UK English dictionaries (as an indication of the correctness of the metadata, e.g. tags), and not having tags that indicate music, sound effects, or speech. As tags indicating speech files we consider those like “speech”, “speak”, and “woman”. We normalize
to the range [
trim the silence (60 dB below the maximum amplitude) from the beginning and end, and resample to 44.1 kHz. Finally, we keep samples that are longer than 15 s as a result of the processing. This results in
For enhancing the diversity of the audio content, we aim to create based on the tags of
, targeting to the most uniform possible dis- tribution of the tags of the audio samples in
We first create the bag of tags T by collecting all the tags of sounds in
We omit tags that describe time or recording equipment and process (e.g. “autumn”, “field-recording”). Then, we calculate the normalized frequency of all tags in T and create
, with tags of a normalized frequency of at least 0.01. We randomly sample 10
sets (with overlap) of
, and keep the set that has the maximum entropy for
. This process results in
, having the most uniform tag distribution and, hence, the most diverse content. The resulting distribution of the tags in
is illustrated in Figure 1. The 10 most common tags are: ambient, water, nature, birds, noise, rain, city, wind, metal, and people.
We target at audio samples x having a uniform distribution between 15 and 30 s. Thus, we further process , keeping the files with a maximum duration of 30 s and cutting a segment from the rest. We randomly select a set of values for the duration of the segments that will maximize the entropy of the duration of the files, discretizing the durations with a resolution of 0.05 s. In order to not pick segment without activity, we sample the files by taking a window with a selected duration that maximizes the energy of the sample. Finally, we apply a 512-point Hamming window to the beginning and the end of the samples, smoothing the effect of sampling. The above process
Figure 1: Distribution of tags in are sorted according to their frequency.
results to , where the distribution of durations is approximately uniform between 15 and 30 s.
2.2 Captions collection and processing
We use AMT and a novel three-step based framework [1] for crowdsourcing the annotation of quiring the set of captions
is an eight to 20 words long caption for
. In a nutshell, each audio sample
annotated by
different annotators in the first step of the framework. The annotators have access only to
and not any other information. In the second step, different annotators are instructed to correct any grammatical errors, typos, and/or rephrase the captions. This process results in 2
captions per
. Finally, three (again different) annotators have access to
captions, and score each caption in terms of the accuracy of the description and fluency of English, using a scale from 1 to 4 (the higher the better). The captions for each
(first according to accuracy of description and then according to fluency), and two groups are formed: the top
and the bottom
captions. The top
captions are selected as
. We manually sanitize further
, e.g. by replacing “it’s” with “it is” or “its”, making consistent hyphenation and compound words (e.g. “nonstop”, “non-stop”, and “non stop”), removing words or rephrasing captions pertaining to the content of speech (e.g. “French”, “foreign”), and removing/replacing named entities (e.g. “Windex”).
Finally, we observe that some captions include transcription of speech. To remove it, we employ extra annotators (not from AMT) which had access only at the captions. We instruct the annotators to remove the transcribed speech and rephrase the caption. If the result is less than eight words, we check the bottom captions for that audio sample. If they include a caption that has been rated with at least 3 by all the annotators for both accuracy and fluency, and does not contain transcribed speech, we use that caption. Otherwise, we remove completely the audio sample. This process yields the final set of audio samples and captions,
respectively, with
An audio sample should belong to only one split of data (e.g., training, development, testing). This means that if a word appears only at the captions of one , then this word will be appearing only at one of the splits. Having a word appearing only in training split leads to sub-optimal learning procedure, because resources are spend to words unused in validation and testing. If a word is not appearing in the training split, then the evaluation procedure suffers by having to evaluate on words not known during training. For that reason, for each
we construct the set of words
. Then, we merge all
and we identify all words that appear only once (i.e. having a frequency of one) in
. We employ an extra annotator (not from AMT) which has access only to the captions of
, and has the instructions to change the all words in
with frequency of one, with other synonym words in
and (if necessary) rephrase the caption. The result is the set of captions
, with words in
having a frequency of at least two. Each word will appear in the development set and at least in one of the evaluation or testing splits. This process yields the data of the Clotho dataset,
2.3 Data splitting
We split D in three non-overlapping splits of 60%-20%-20%, termed as development, evaluation, and testing, respectively. Every word in the captions of D appears at the development split and at least in one of the other two splits.
For each we construct the set of unique words
from its captions
, using all letters in small-case and excluding punctuation. We merge all
bag
and calculate the frequency
of each word w. We use multi-label stratification
[10], having as labels for each
the corresponding words
split D 2000 times in sets of splits of 60%-40%, where 60% corresponds to the development split. We reject the sets of splits that have at least one word appearing only in one of the splits. Ideally, the words with
split. The other appearance of word should be in the evaluation or testing splits. This will prevent having unused words in the training (i.e. words appearing only in the development split) or unknown words in the evaluation/testing process (i.e. words not appearing in the development split). The words with
should appear
times in the development split, where 0.6 is the percentage of data in the development split and
is the floor function. We calculate the frequency of words in the development split,
, and observe that it is impossible to satisfy the
for the words with
3. Therefore, we adopted a tolerance
(i.e. a deviation) to the
The tolerance means, for example, that we can tolerate a word appearing a total of 3 times in the whole Clotho dataset D, to appear 2 times in the development split (appearing 0 times in development split results in the rejection of the split set). This will result to this word appearing in either evaluation or testing split, but still this word will not appear only in one split. To pick the best set of splits, we count the amount of words that have a frequency ]. We score, in an ascending fashion, the sets of splits according to that amount of words and we pick the top 50 ones. For each of the 50 sets of splits, we further separate the 40% split to 20% and 20%, 1000 times. That is, we end up with 50 000 sets of splits of 60%, 20%, 20%, corresponding to development, evaluation, and testing splits, respectively. We want to score each of these sets of splits, in order to select the split with the smallest amount of words that deviate from the ideal split for each of these 50 000 sets of splits. We calculate the frequency of appearance of each word in the development, evaluation, and testing splits,
, respectively. Then, we create the sets of words Ψ
having the words with
respectively, where
. Finally, we calculate the sum of the weighted distance of frequencies of words from the
words being in the development split or not, respectively), Γ, as
where . We sort all 50 000 sets of splits according to Γ and in ascending fashion, and we pick the top one. This set of splits is the final split for the Clotho dataset, containing 2893 audio samples and 14465 captions in development split, 1045 audio samples and 5225 captions in evaluation split, and 1043 audio samples and 5215 captions in the testing split. The development and evaluation splits are freely available online
. The testing split is withheld for potential usage in scientific challenges. A fully detailed description of the Clotho dataset can be found online
. In Figure 2 is a histogram of the percentage of words (
the three different splits.
In order to provide an example of how to employ Clotho and some initial (baseline) results, we use a previously utilized method for audio captioning [4] which is based on an encoder-decoder scheme with attention. The method accepts as an input a length- T sequence of 64 log mel-band energies which is used as an input to a DNN which outputs a probability distribution of words. The generated caption is constructed from the output of the DNN, as in [4]. We optimize the parameters of the method using the development split of Clotho and we evaluate it using the evaluation and the testing splits, separately.
We first extract 64 log mel-band energies, using a Hamming window of 46 ms, with 50% overlap. We
Figure 2: Percentage of words () in the three different splits
tokenize the captions of the development split, using a one-hot encoding of the words. Since all the words in in the development split appear in the other two splits as well, there are no unknown tokens/words. We also employ the start- and end-of-sequence tokens (respectively), in order to signify the start and end of a caption.
The encoder is a series of bi-directional gated recurrent units (bi-GRUs) [11], similarly to [4]. The output dimensionality for the GRU layers (forward and backward GRUs have same dimensionality) is {256, 256, 256}. The output of the encoder is processed by an attention mechanism and its output is given as an input to the decoder. The attention mechanism is a feed-forward neural network (FNN) and the decoder a GRU. Then, the output of the decoder is given as an input to another FNN with a softmax nonlinearity, which acts as a classifier and outputs the probability distribution of words for the i-th timestep. To optimize the parameters of the employed method, we use five times each audio sample, using its five different captions as targeted outputs each time. We optimize jointly the parameters of the encoder, attention mechanism, decoder, and the classifier, using 150 epochs, the cross entropy loss, and Adam optimizer [12] with proposed hyper-parameters. Also, in each batch we pad the captions of the batch to the longest in the same batch, using the end-of-sequence token, and the input audio features to the longest ones, by prepending zeros.
We assess the performance of the method on evaluation and testing splits, using the machine transla-
Table 1: Translation metrics for the evaluation and testing splits. B, C, M, and R correspond to BLEU
CIDEr, METEOR, and ROUGE, respectively.
tion metrics BLEU4), METEOR, CIDEr, and ROUGE
for comparing the output of the method and the reference captions for the input audio sample. In a nutshell, BLEU
measures a mod-ified precision of n-grams (e.g. BLEU
for 2-grams), METEOR measures a harmonic mean-based score of the precision and recall for unigrams, CIDEr measures a weighted cosine similarity of n-grams, and ROUGE
is a longest common subsequence-based score.
In Table 1 are the scores of the employed metrics for the evaluation and testing splits. As can be seen from Table 1 and BLEU, the method has started identifying the content of the audio samples by outputting words that exist in the reference captions. For example, the method outputs “water is running into a container into a”, while the closest reference caption is “water pouring into a container with water in it already”, or “birds are of chirping the chirping and various chirping” while the closest reference is “several different kinds of birds are chirping and singing”. The scores of the rest metrics reveal that the structure of the sentence and order of the words are not correct. These are issues that can be tackled by adopting either a pre-calculated or jointly learnt language model. In any case, the results show that the Clotho dataset can effectively be used for research on audio captioning, posing useful data in tackling the challenging task of audio content description.
In this work we present a novel dataset for audio captioning, named Clotho, that contains 4981 audio samples and five captions for each file (totaling to 24 905 captions). During the creating of Clotho care has been taken in order to promote diversity of captions, eliminate words that appear only once and named entities, and provide data splits that do not hamper the training or evaluation process. Also, there is an example of the usage of Clotho, using a method proposed at the original work of audio captioning. The baseline results indicate that the baseline method started learning the content of the input audio, but more tuning is needed in order to express the content properly. Future work includes the employment of Clotho and development of novel methods for audio captioning.
The research leading to these results has received funding from the European Research Council under the European Unions H2020 Framework Programme through ERC Grant Agreement 637422 EVERYSOUND. Part of the computations leading to these results were performed on a TITAN-X GPU donated by NVIDIA to K. Drossos. The authors also wish to acknowledge CSC-IT Center for Science, Finland, for computational resources.
[1] S. Lipping, K. Drossos, and T. Virtanen, “Crowdsourcing a dataset of audio captions,” in Detection and Classification of Acoustic Scenes and Events (DCASE) 2019, Oct. 2019.
[2] A. Karpathy and L. Fei-Fei, “Deep visualsemantic alignments for generating image descriptions,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 664–676, Apr. 2017.
[3] P. Young, A. Lai, M. Hodosh, and J. Hocken- maier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
[4] K. Drossos, S. Adavanne, and T. Virtanen, “Au- tomated audio captioning with recurrent neural networks,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Oct. 2017, pp. 374–378.
[5] T.-Y. Lin, M. Maire, S. J. Belongie, L. D. Bour- dev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014.
[6] M. Wu, H. Dinkel, and K. Yu, “Audio caption: Listen and tell,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 830–834.
[7] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Au- dioCaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Jun. 2019, pp. 119–132, Association for Computational Linguistics.
[8] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “AudioSet: An ontology and human-labeled dataset for audio events,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017, pp. 776–780.
[9] F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in ACM International Confer- , Barcelona, Spain, 21/10/2013 2013, ACM, pp. 411–412, ACM.
[10] K. Sechidis, G. Tsoumakas, and I. Vlahavas, “On the stratification of multi-label data,” in Machine Learning and Knowledge Discovery in Databases, D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, Eds., Berlin, Heidelberg, 2011, pp. 145–158, Springer Berlin Heidelberg.
[11] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Oct 2014, pp. 1724–1734.
[12] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference for Learning Representations, May 2015.