Transformers documentation

Bark

Join the Model Database community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Bark

Overview

Bark is a transformer-based text-to-speech model proposed by Suno AI in suno-ai/bark.

Bark is made of 4 main models:

BarkSemanticModel (also referred to as the ‘text’ model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
BarkCoarseModel (also referred to as the ‘coarse acoustics’ model): a causal autoregressive transformer, that takes as input the results of the BarkSemanticModel model. It aims at predicting the first two audio codebooks necessary for EnCodec.
BarkFineModel (the ‘fine acoustics’ model), this time a non-causal autoencoder transformer, which iteratively predicts the last codebooks based on the sum of the previous codebooks embeddings.
having predicted all the codebook channels from the EncodecModel, Bark uses it to decode the output audio array.

It should be noted that each of the first three modules can support conditional speaker embeddings to condition the output sound according to specific predefined voice.

Optimizing Bark

Bark can be optimized with just a few extra lines of code, which significantly reduces its memory footprint and accelerates inference.

Using half-precision

You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision.

from transformers import BarkModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)

Using 🤗 Better Transformer

Better Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🤗 Better Transformer:

model =  model.to_bettertransformer()

Note that 🤗 Optimum must be installed before using this feature. Here’s how to install it.

Using CPU offload

As mentioned above, Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.

If you’re using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the GPU’s submodels when they’re idle. This operation is called CPU offloading. You can use it with one line of code.

model.enable_cpu_offload()

Note that 🤗 Accelerate must be installed before using this feature. Here’s how to install it.

Combining optimizaton techniques

You can combine optimization techniques, and use CPU offload, half-precision and 🤗 Better Transformer all at once.

from transformers import BarkModel
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# load in fp16
model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16).to(device)

# convert to bettertransformer
model = BetterTransformer.transform(model, keep_original_model=False)

# enable CPU offload
model.enable_cpu_offload()

Find out more on inference optimization techniques here.

Tips

Suno offers a library of voice presets in a number of languages here. These presets are also uploaded in the hub here or here.

>>> from transformers import AutoProcessor, BarkModel

>>> processor = AutoProcessor.from_pretrained("suno/bark")
>>> model = BarkModel.from_pretrained("suno/bark")

>>> voice_preset = "v2/en_speaker_6"

>>> inputs = processor("Hello, my dog is cute", voice_preset=voice_preset)

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects.

>>> # Multilingual speech - simplified Chinese
>>> inputs = processor("惊人的！我会说中文")

>>> # Multilingual speech - French - let's use a voice_preset as well
>>> inputs = processor("Incroyable! Je peux générer du son.", voice_preset="fr_speaker_5")

>>> # Bark can also generate music. You can help it out by adding music notes around your lyrics.
>>> inputs = processor("♪ Hello, my dog is cute ♪")

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

The model can also produce nonverbal communications like laughing, sighing and crying.

>>> # Adding non-speech cues to the input text
>>> inputs = processor("Hello uh ... [clears throat], my dog is cute [laughter]")

>>> audio_array = model.generate(**inputs)
>>> audio_array = audio_array.cpu().numpy().squeeze()

To save the audio, simply take the sample rate from the model config and some scipy utility:

>>> from scipy.io.wavfile import write as write_wav

>>> # save audio to disk, but first take the sample rate from the model config
>>> sample_rate = model.generation_config.sample_rate
>>> write_wav("bark_generation.wav", sample_rate, audio_array)

This model was contributed by Yoach Lacombe (ylacombe) and Sanchit Gandhi (sanchit-gandhi). The original code can be found here.

Transformers

Bark

Overview

Optimizing Bark

Using half-precision

Using 🤗 Better Transformer

Using CPU offload

Combining optimizaton techniques

Tips

BarkConfig

class transformers.BarkConfig

from_sub_model_configs

BarkProcessor

class transformers.BarkProcessor

__call__

from_pretrained

save_pretrained

BarkModel

class transformers.BarkModel

generate

enable_cpu_offload

BarkSemanticModel

class transformers.BarkSemanticModel

forward

BarkCoarseModel

class transformers.BarkCoarseModel

forward

BarkFineModel

class transformers.BarkFineModel

forward

BarkCausalModel

class transformers.BarkCausalModel

forward

BarkCoarseConfig

class transformers.BarkCoarseConfig

BarkFineConfig

class transformers.BarkFineConfig

BarkSemanticConfig

class transformers.BarkSemanticConfig

call