AudioLM is a model that generates audio in the waveform domain.
It uses 2 tokenizers: Soundstream
to compute the Acoustic tokens and w2v-BERT
to compute the Semantic tokens.
Soundstream [2] is a SOTA neaural audio codec. The model has 3 parts:
The convolutional encoder/decoder takes a single channel waveform $x \in R^T$ and reconstructs it $\hat{x} \in R^T$ from the quantized embeddings. The embeddings are discretized using a residual vector quantizer (RVQ) with $Q$ vector quantizers each one composed by $N$ vocabulary symbols.
Parameters:
The adversarial loss is computed with a STFT-based discriminator. The input to the STFTDiscriminator is the complexvalued STFT of the input waveform (real and imaginary parts) and the output, the logits.
w2v-BERT [3] is a Transformer-based model for learning self-supervised audio representations. It maps a input waveform to a set of linguistic features.
Random sampling $ [1:N_q] $ $ \rightarrow $ $ Q_i \quad i=1…n_q $ quantizers
Set $ n_q $ to change the desired bitrate value.
[1] AudioLM: a Language Modeling Approach to Audio Generation