AudioLDM
AudioLDM was proposed in AudioLDM: Text-to-Audio Generation with Latent Diffusion Models by Haohe Liu et al. Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
The abstract from the paper is:
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io.
The original codebase can be found at haoheliu/AudioLDM.
Tips
When constructing a prompt, keep in mind:
- Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, “high quality” or “clear”) and make the prompt context specific (for example, “water stream in a forest” instead of “stream”).
- It’s best to use general terms like “cat” or “dog” instead of specific names or abstract objects the model may not be familiar with.
During inference:
- The quality of the predicted audio sample can be controlled by the
num_inference_steps
argument; higher steps give higher quality audio at the expense of slower inference. - The length of the predicted audio sample can be controlled by varying the
audio_length_in_s
argument.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
AudioLDMPipeline
class diffusers.AudioLDMPipeline
< source >( vae: AutoencoderKL text_encoder: ClapTextModelWithProjection tokenizer: typing.Union[transformers.models.roberta.tokenization_roberta.RobertaTokenizer, transformers.models.roberta.tokenization_roberta_fast.RobertaTokenizerFast] unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers vocoder: SpeechT5HifiGan )
Parameters
- vae (AutoencoderKL) — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
-
text_encoder (
ClapTextModelWithProjection
) — Frozen text-encoder (ClapTextModelWithProjection
, specifically the laion/clap-htsat-unfused variant. -
tokenizer (
PreTrainedTokenizer
) — ARobertaTokenizer
to tokenize text. -
unet (UNet2DConditionModel) —
A
UNet2DConditionModel
to denoise the encoded audio latents. -
scheduler (SchedulerMixin) —
A scheduler to be used in combination with
unet
to denoise the encoded audio latents. Can be one of DDIMScheduler, LMSDiscreteScheduler, or PNDMScheduler. -
vocoder (
SpeechT5HifiGan
) — Vocoder of classSpeechT5HifiGan
.
Pipeline for text-to-audio generation using AudioLDM.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >(
prompt: typing.Union[str, typing.List[str]] = None
audio_length_in_s: typing.Optional[float] = None
num_inference_steps: int = 10
guidance_scale: float = 2.5
negative_prompt: typing.Union[str, typing.List[str], NoneType] = None
num_waveforms_per_prompt: typing.Optional[int] = 1
eta: float = 0.0
generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None
latents: typing.Optional[torch.FloatTensor] = None
prompt_embeds: typing.Optional[torch.FloatTensor] = None
negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None
return_dict: bool = True
callback: typing.Union[typing.Callable[[int, int, torch.FloatTensor], NoneType], NoneType] = None
callback_steps: typing.Optional[int] = 1
cross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = None
output_type: typing.Optional[str] = 'np'
)
→
AudioPipelineOutput or tuple
Parameters
-
prompt (
str
orList[str]
, optional) — The prompt or prompts to guide audio generation. If not defined, you need to passprompt_embeds
. -
audio_length_in_s (
int
, optional, defaults to 5.12) — The length of the generated audio sample in seconds. -
num_inference_steps (
int
, optional, defaults to 10) — The number of denoising steps. More denoising steps usually lead to a higher quality audio at the expense of slower inference. -
guidance_scale (
float
, optional, defaults to 2.5) — A higher guidance scale value encourages the model to generate audio that is closely linked to the textprompt
at the expense of lower sound quality. Guidance scale is enabled whenguidance_scale > 1
. -
negative_prompt (
str
orList[str]
, optional) — The prompt or prompts to guide what to not include in audio generation. If not defined, you need to passnegative_prompt_embeds
instead. Ignored when not using guidance (guidance_scale < 1
). -
num_waveforms_per_prompt (
int
, optional, defaults to 1) — The number of waveforms to generate per prompt. -
eta (
float
, optional, defaults to 0.0) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the DDIMScheduler, and is ignored in other schedulers. -
generator (
torch.Generator
orList[torch.Generator]
, optional) — Atorch.Generator
to make generation deterministic. -
latents (
torch.FloatTensor
, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator
. -
prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from theprompt
input argument. -
negative_prompt_embeds (
torch.FloatTensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embeds
are generated from thenegative_prompt
input argument. -
return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a AudioPipelineOutput instead of a plain tuple. -
callback (
Callable
, optional) — A function that calls everycallback_steps
steps during inference. The function is called with the following arguments:callback(step: int, timestep: int, latents: torch.FloatTensor)
. -
callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function is called. If not specified, the callback is called at every step. -
cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. -
output_type (
str
, optional, defaults to"np"
) — The output format of the generated image. Choose between"np"
to return a NumPynp.ndarray
or"pt"
to return a PyTorchtorch.Tensor
object.
Returns
AudioPipelineOutput or tuple
If return_dict
is True
, AudioPipelineOutput is returned, otherwise a tuple
is
returned where the first element is a list with the generated audio.
The call function to the pipeline for generation.
Examples:
>>> from diffusers import AudioLDMPipeline
>>> import torch
>>> import scipy
>>> repo_id = "cvssp/audioldm-s-full-v2"
>>> pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
>>> audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
>>> # save the audio sample as a .wav file
>>> scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
Disable sliced VAE decoding. If enable_vae_slicing
was previously enabled, this method will go back to
computing decoding in one step.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
AudioPipelineOutput
class diffusers.AudioPipelineOutput
< source >( audios: ndarray )
Output class for audio pipelines.