audiolm-google-torch

AudioLM: a Language Modeling Approach to Audio Generation

AudioLM is a model that generates audio in the waveform domain.

It uses 2 tokenizers: Soundstream to compute the Acoustic tokens and w2v-BERT to compute the Semantic tokens.

Soundstream: Acoustic Tokens

Soundstream [2] is a SOTA neaural audio codec. The model has 3 parts:

The convolutional encoder/decoder takes a single channel waveform $x \in R^T$ and reconstructs it $\hat{x} \in R^T$ from the quantized embeddings. The embeddings are discretized using a residual vector quantizer (RVQ) with $Q$ vector quantizers each one composed by $N$ vocabulary symbols.

Parameters:

Discriminator

The adversarial loss is computed with a STFT-based discriminator. The input to the STFTDiscriminator is the complexvalued STFT of the input waveform (real and imaginary parts) and the output, the logits.

w2v-BERT: Semantic Tokens

w2v-BERT [3] is a Transformer-based model for learning self-supervised audio representations. It maps a input waveform to a set of linguistic features.

Training

Random sampling $ [1:N_q] $ $ \rightarrow $ $ Q_i \quad i=1…n_q $ quantizers

Inference

Set $ n_q $ to change the desired bitrate value.

References

[1] AudioLM: a Language Modeling Approach to Audio Generation

[2] SoundStream: An End-to-End Neural Audio Codec

[3] w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training