Trainer
At TRL we support PPO (Proximal Policy Optimisation) with an implementation that largely follows the structure introduced in the paper “Fine-Tuning Language Models from Human Preferences” by D. Ziegler et al. [paper, code].
The Trainer and model classes are largely inspired from transformers.Trainer
and transformers.AutoModel
classes and adapted for RL.
We also support a RewardTrainer
that can be used to train a reward model.
PPOConfig
class trl.PPOConfig
< source >( task_name: typing.Optional[str] = None model_name: typing.Optional[str] = None steps: typing.Optional[int] = 20000 learning_rate: typing.Optional[float] = 1e-05 adap_kl_ctrl: typing.Optional[bool] = True init_kl_coef: typing.Optional[float] = 0.2 kl_penalty: typing.Optional[str] = 'kl' target: typing.Optional[float] = 6 horizon: typing.Optional[float] = 10000 gamma: typing.Optional[float] = 1 lam: typing.Optional[float] = 0.95 cliprange: typing.Optional[float] = 0.2 cliprange_value: typing.Optional[float] = 0.2 vf_coef: typing.Optional[float] = 0.1 batch_size: typing.Optional[int] = 256 forward_batch_size: typing.Optional[int] = None mini_batch_size: typing.Optional[int] = 1 backward_batch_size: typing.Optional[int] = 1 gradient_accumulation_steps: typing.Optional[int] = 1 ppo_epochs: typing.Optional[int] = 4 remove_unused_columns: typing.Optional[bool] = True log_with: typing.Optional[str] = None tracker_kwargs: typing.Optional[dict] = <factory> accelerator_kwargs: typing.Optional[dict] = <factory> project_kwargs: typing.Optional[dict] = <factory> tracker_project_name: typing.Optional[str] = 'trl' max_grad_norm: typing.Optional[float] = None seed: typing.Optional[int] = 0 optimize_cuda_cache: typing.Optional[bool] = False early_stopping: typing.Optional[bool] = False target_kl: typing.Optional[float] = 0.1 push_to_hub_if_best_kwargs: typing.Optional[dict] = <factory> compare_steps: typing.Optional[int] = 1 ratio_threshold: typing.Optional[float] = 10.0 use_score_scaling: typing.Optional[bool] = False use_score_norm: typing.Optional[bool] = False score_clip: typing.Optional[float] = None )
Configuration class for PPOTrainer
PPOTrainer
class trl.PPOTrainer
< source >( config: PPOConfig = None model: PreTrainedModelWrapper = None ref_model: typing.Optional[trl.models.modeling_base.PreTrainedModelWrapper] = None tokenizer: PreTrainedTokenizerBase = None dataset: typing.Union[torch.utils.data.dataset.Dataset, datasets.arrow_dataset.Dataset, NoneType] = None optimizer: typing.Optional[torch.optim.optimizer.Optimizer] = None data_collator: typing.Optional[typing.Callable] = None num_shared_layers: typing.Optional[int] = None lr_scheduler: typing.Optional[torch.optim.lr_scheduler._LRScheduler] = None )
Parameters
-
**config** (
PPOConfig
) — Configuration object for PPOTrainer. Check the documentation ofPPOConfig
for more — details. -
**model** (
PreTrainedModelWrapper
) — Model to be optimized, Model Database transformer model with a value head. — Check the documentation ofPreTrainedModelWrapper
for more details. -
**ref_model** (
PreTrainedModelWrapper
, optional) — Reference model to be used for KL penalty, Model Database — transformer model with a casual language modelling head. Check the documentation ofPreTrainedModelWrapper
for more details. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized with shared layers. -
**tokenizer** (
PreTrainedTokenizerBase
) — Tokenizer to be used for encoding the — data. Check the documentation oftransformers.PreTrainedTokenizer
andtransformers.PreTrainedTokenizerFast
for more details. -
**dataset** (Union[
torch.utils.data.Dataset
,datasets.Dataset
], optional) — PyTorch dataset or Hugging — Face dataset. This is used to create a PyTorch dataloader. If no dataset is provided, the dataloader must be created outside the trainer users needs to design their own dataloader and make sure the batch size that is used is the same as the one specified in the configuration object. -
**optimizer** (
torch.optim.Optimizer
, optional) — Optimizer to be used for training. If no optimizer is — provided, the trainer will create an Adam optimizer with the learning rate specified in the configuration object. - **data_collator** (DataCollatorForLanguageModeling, optional) — Data collator to be used for training and — passed along the dataloader
- **num_shared_layers** (int, optional) — Number of layers to be shared between the model and the reference — model, if no reference model is passed. If no number is provided, all the layers will be shared.
-
**lr_scheduler** (
torch.optim.lr_scheduler
, optional) — Learning rate scheduler to be used for training. —
The PPOTrainer uses Proximal Policy Optimization to optimise language models. Note, this trainer is heavily inspired by the original OpenAI learning to summarize work here: https://github.com/openai/summarize-from-feedback
batched_forward_pass
< source >( model: PreTrainedModelWrapper queries: Tensor responses: Tensor model_inputs: dict return_logits: bool = False response_masks: typing.Optional[torch.Tensor] = None ) → (tuple)
Parameters
-
queries (
torch.LongTensor
) — List of tensors containing the encoded queries, shape (batch_size
,query_length
) -
responses (
torch.LongTensor
) — List of tensors containing the encoded responses, shape (batch_size
,response_length
) -
return_logits (
bool
, optional, defaults toFalse
) — Whether to return all_logits. Set toFalse
if logits are not needed to reduce memory consumption.
Returns
(tuple)
- all_logprobs (
torch.FloatTensor
): Log probabilities of the responses, shape (batch_size
,response_length
) - all_ref_logprobs (
torch.FloatTensor
): Log probabilities of the responses, shape (batch_size
,response_length
) - all_values (
torch.FloatTensor
): Values of the responses, shape (batch_size
,response_length
)
Calculate model outputs in multiple batches.
compute_rewards
< source >( scores: FloatTensor logprobs: FloatTensor ref_logprobs: FloatTensor masks: LongTensor )
Parameters
Compute per token rewards from scores and KL-penalty.
create_model_card
< source >( path: str model_name: typing.Optional[str] = 'TRL Model' )
Creates and saves a model card for a TRL model.
gather_stats
< source >(
stats
)
→
dict[str, Any]
Gather stats from all processes. Useful in the context of distributed training.
generate
< source >(
query_tensor: typing.Union[torch.Tensor, typing.List[torch.Tensor]]
length_sampler: typing.Callable = None
batch_size: int = 4
return_prompt: bool = True
**generation_kwargs
)
→
torch.LongTensor
Parameters
-
query_tensor (
torch.LongTensor
) — A tensor of shape (batch_size
,seq_len
) containing query tokens. - generation_kwargs (dict[str, Any]) — Keyword arguments for generation.
-
length_sampler (
Callable
, optional) — Callable that returns the number of newly generated tokens. -
batch_size (
int
, *optional) — Batch size used for generation, defaults to4
. -
return_prompt (
bool
, optional) — If set toFalse
the prompt is not returned but only the newly generated tokens, defaults toTrue
.
Returns
torch.LongTensor
A tensor of shape (batch_size
, gen_len
) containing response tokens.
Generate response with the model given the query tensor.
call the generate
method of the model.
log_stats
< source >( stats: dict batch: dict rewards: typing.List[torch.FloatTensor] columns_to_log: typing.List[str] = ['query', 'response'] )
A function that logs all the training stats. Call it at the end of each epoch.
loss
< source >( old_logprobs: FloatTensor values: FloatTensor logits: FloatTensor vpreds: FloatTensor logprobs: FloatTensor mask: LongTensor advantages: FloatTensor returns: FloatTensor )
Parameters
-
old_logprobs (
torch.FloatTensor
) — Log probabilities of the model, shape (batch_size
,response_length
) -
values (
torch.FloatTensor
) — Values of the value head, shape (batch_size
,response_length
) -
rewards (
torch.FloatTensor
) — Rewards from the reward model, shape (batch_size
,response_length
) -
logits (
torch.FloatTensor
) — Logits of the model, shape (batch_size
,response_length
,vocab_size
) -
v_pred (
torch.FloatTensor
) — Values of the value head, shape (batch_size
,response_length
) -
logprobs (
torch.FloatTensor
) — Log probabilities of the model, shape (batch_size
,response_length
)
Calculate policy and value losses.
prepare_dataloader
< source >(
dataset: typing.Union[torch.utils.data.dataset.Dataset, datasets.arrow_dataset.Dataset]
data_collator = None
)
→
torch.utils.data.DataLoader
Parameters
-
dataset (Union[
torch.utils.data.Dataset
,datasets.Dataset
]) — PyTorch dataset or Model Database dataset. If a Model Database dataset is passed, the dataset will be preprocessed by removing the columns that are not used by the model. - data_collator (Optional[function]) — Data collator function.
Returns
torch.utils.data.DataLoader
PyTorch dataloader
Prepare the dataloader for training.
record_step_stats
< source >(
kl_coef: float
**data
)
→
stats (dict
)
Record training step statistics.
step
< source >(
queries: typing.List[torch.LongTensor]
responses: typing.List[torch.LongTensor]
scores: typing.List[torch.FloatTensor]
response_masks: typing.Optional[typing.List[torch.LongTensor]] = None
)
→
dict[str, Any]
Parameters
-
queries (List
torch.LongTensor
) — List of tensors containing the encoded queries of shape (query_length
) -
responses (List
torch.LongTensor
) — List of tensors containing the encoded responses of shape (response_length
) -
scores (List
torch.FloatTensor
) — List of tensors containing the scores. -
response_masks (List
torch.FloatTensor
, optional)) — List of tensors containing masks of the response tokens.
Returns
dict[str, Any]
A summary of the training statistics
Run a PPO optimisation step given a list of queries, model responses, and rewards.
train_minibatch
< source >(
old_logprobs: FloatTensor
values: FloatTensor
logprobs: FloatTensor
logits: FloatTensor
vpreds: FloatTensor
mask: LongTensor
advantages: FloatTensor
returns: FloatTensor
)
→
train_stats (dict[str, torch.Tensor
])
Parameters
-
logprobs (
torch.FloatTensor
) — Log probabilities of the model, shape [batch_size, response_length] -
values (
torch.FloatTensor
) — Values of the value head, shape [batch_size, response_length] -
query (
torch.LongTensor
) — Encoded queries, shape [batch_size, query_length] -
response (
torch.LongTensor
) — Encoded responses, shape [batch_size, response_length] -
model_input (
torch.LongTensor
) — Concatenated queries and responses, shape [batch_size, query_length+response_length]
Returns
train_stats (dict[str, torch.Tensor
])
Dictionary of training statistics
Train one PPO minibatch
RewardTrainer
class trl.RewardTrainer
< source >( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = None args: TrainingArguments = None data_collator: typing.Optional[DataCollator] = None train_dataset: typing.Optional[datasets.arrow_dataset.Dataset] = None eval_dataset: typing.Union[datasets.arrow_dataset.Dataset, typing.Dict[str, datasets.arrow_dataset.Dataset], NoneType] = None tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None model_init: typing.Union[typing.Callable[[], transformers.modeling_utils.PreTrainedModel], NoneType] = None compute_metrics: typing.Union[typing.Callable[[transformers.trainer_utils.EvalPrediction], typing.Dict], NoneType] = None callbacks: typing.Optional[typing.List[transformers.trainer_callback.TrainerCallback]] = None optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None) preprocess_logits_for_metrics: typing.Union[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = None max_length: typing.Optional[int] = None peft_config: typing.Optional[typing.Dict] = None )
The RewardTrainer can be used to train your custom Reward Model. It is a subclass of the
transformers.Trainer
class and inherits all of its attributes and methods. It is recommended to use
an AutoModelForSequenceClassification
as the reward model. The reward model should be trained on a dataset
of paired examples, where each example is a tuple of two sequences. The reward model should be trained to
predict which example in the pair is more relevant to the task at hand.
The reward trainer expects a very specific format for the dataset. The dataset should contain two 4 entries at least
if you don’t use the default RewardDataCollatorWithPadding
data collator. The entries should be named
input_ids_chosen
attention_mask_chosen
input_ids_rejected
attention_mask_rejected
SFTTrainer
class trl.SFTTrainer
< source >( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module, str] = None args: TrainingArguments = None data_collator: typing.Optional[DataCollator] = None train_dataset: typing.Optional[datasets.arrow_dataset.Dataset] = None eval_dataset: typing.Union[datasets.arrow_dataset.Dataset, typing.Dict[str, datasets.arrow_dataset.Dataset], NoneType] = None tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None model_init: typing.Union[typing.Callable[[], transformers.modeling_utils.PreTrainedModel], NoneType] = None compute_metrics: typing.Union[typing.Callable[[transformers.trainer_utils.EvalPrediction], typing.Dict], NoneType] = None callbacks: typing.Optional[typing.List[transformers.trainer_callback.TrainerCallback]] = None optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None) preprocess_logits_for_metrics: typing.Union[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = None peft_config: typing.Optional[typing.Dict] = None dataset_text_field: typing.Optional[str] = None packing: typing.Optional[bool] = False formatting_func: typing.Optional[typing.Callable] = None max_seq_length: typing.Optional[int] = None infinite: typing.Optional[bool] = False num_of_sequences: typing.Optional[int] = 1024 chars_per_token: typing.Optional[float] = 3.6 dataset_num_proc: typing.Optional[int] = None dataset_batch_size: int = 1000 )
Parameters
-
model (Union[
transformers.PreTrainedModel
,nn.Module
,str
]) — The model to train, can be aPreTrainedModel
, atorch.nn.Module
or a string with the model name to load from cache or download. The model can be also converted to aPeftModel
if aPeftConfig
object is passed to thepeft_config
argument. -
args (Optionaltransformers.TrainingArguments) —
The arguments to tweak for training. Please refer to the official documentation of
transformers.TrainingArguments
for more information. -
data_collator (Optional
transformers.DataCollator
) — The data collator to use for training. -
train_dataset (Optionaldatasets.Dataset) —
The dataset to use for training. We recommend users to use
trl.trainer.ConstantLengthDataset
to create their dataset. -
eval_dataset (Optional[Union[
datasets.Dataset
, Dict[str
,datasets.Dataset
]]]) — The dataset to use for evaluation. We recommend users to usetrl.trainer.ConstantLengthDataset
to create their dataset. - tokenizer (Optionaltransformers.PreTrainedTokenizer) — The tokenizer to use for training. If not specified, the tokenizer associated to the model will be used.
-
model_init (
Callable[[], transformers.PreTrainedModel]
) — The model initializer to use for training. If None is specified, the default model initializer will be used. -
compute_metrics (
Callable[[transformers.EvalPrediction], Dict]
, optional defaults tocompute_accuracy
) — The metrics to use for evaluation. If no metrics are specified, the default metric (compute_accuracy
) will be used. -
callbacks (
List[transformers.TrainerCallback]
) — The callbacks to use for training. -
optimizers (
Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]
) — The optimizer and scheduler to use for training. -
preprocess_logits_for_metrics (
Callable[[torch.Tensor, torch.Tensor], torch.Tensor]
) — The function to use to preprocess the logits before computing the metrics. -
peft_config (
Optional[PeftConfig]
) — The PeftConfig object to use to initialize the PeftModel. -
dataset_text_field (
Optional[str]
) — The name of the text field of the dataset, in case this is passed by a user, the trainer will automatically create aConstantLengthDataset
based on thedataset_text_field
argument. -
formatting_func (
Optional[Callable]
) — The formatting function to be used for creating theConstantLengthDataset
. -
max_seq_length (
Optional[int]
) — The maximum sequence length to use for theConstantLengthDataset
and for automaticallty creating the Dataset. Defaults to512
. -
infinite (
Optional[bool]
) — Whether to use an infinite dataset or not. Defaults toFalse
. -
num_of_sequences (
Optional[int]
) — The number of sequences to use for theConstantLengthDataset
. Defaults to1024
. -
chars_per_token (
Optional[float]
) — The number of characters per token to use for theConstantLengthDataset
. Defaults to3.6
. You can check how this is computed in the stack-llama example: https://github.com/huggingface/trl/blob/08f550674c553c36c51d1027613c29f14f3676a5/examples/stack_llama/scripts/supervised_finetuning.py#L53. -
packing (
Optional[bool]
) — Used only in casedataset_text_field
is passed. This argument is used by theConstantLengthDataset
to pack the sequences of the dataset. -
dataset_num_proc (
Optional[int]
) — The number of workers to use to tokenize the data. Only used whenpacking=False
. Defaults to None. -
dataset_batch_size (
int
) — The number of examples to tokenize per batch. If batch_size <= 0 or batch_size == None, tokenize the full dataset as a single batch. Defaults to 1000.
Class definition of the Supervised Finetuning Trainer (SFT Trainer).
This class is a wrapper around the transformers.Trainer
class and inherits all of its attributes and methods.
The trainer takes care of properly initializing the PeftModel in case a user passes a PeftConfig
object.
DPOTrainer
class trl.DPOTrainer
< source >( model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module] = None ref_model: typing.Union[transformers.modeling_utils.PreTrainedModel, torch.nn.modules.module.Module, NoneType] = None beta: float = 0.1 args: TrainingArguments = None data_collator: typing.Optional[DataCollator] = None label_pad_token_id: int = -100 padding_value: int = 0 truncation_mode: str = 'keep_end' train_dataset: typing.Optional[datasets.arrow_dataset.Dataset] = None eval_dataset: typing.Union[datasets.arrow_dataset.Dataset, typing.Dict[str, datasets.arrow_dataset.Dataset], NoneType] = None tokenizer: typing.Optional[transformers.tokenization_utils_base.PreTrainedTokenizerBase] = None model_init: typing.Union[typing.Callable[[], transformers.modeling_utils.PreTrainedModel], NoneType] = None callbacks: typing.Optional[typing.List[transformers.trainer_callback.TrainerCallback]] = None optimizers: typing.Tuple[torch.optim.optimizer.Optimizer, torch.optim.lr_scheduler.LambdaLR] = (None, None) preprocess_logits_for_metrics: typing.Union[typing.Callable[[torch.Tensor, torch.Tensor], torch.Tensor], NoneType] = None max_length: typing.Optional[int] = None max_prompt_length: typing.Optional[int] = None peft_config: typing.Optional[typing.Dict] = None disable_dropout: bool = True )
Parameters
-
model (
transformers.PreTrainedModel
) — The model to train, preferably anAutoModelForSequenceClassification
. -
ref_model (
PreTrainedModelWrapper
) — Model Database transformer model with a casual language modelling head. Used for implicit reward computation and loss. If no reference model is provided, the trainer will create a reference model with the same architecture as the model to be optimized. -
beta (
float
, defaults to 0.1) — The beta factor in DPO loss. Higher beta means less divergence from the initial policy. -
args (
transformers.TrainingArguments
) — The arguments to use for training. -
data_collator (
transformers.DataCollator
) — The data collator to use for training. If None is specified, the default data collator (DPODataCollatorWithPadding
) will be used which will pad the sequences to the maximum length of the sequences in the batch, given a dataset of paired sequences. -
label_pad_token_id (
int
, defaults to-100
) — The label pad token id. This argument is required if you want to use the default data collator. -
padding_value (
int
, defaults to0
) — The padding value. This argument is required if you want to use the default data collator. -
truncation_mode (
str
, defaults tokeep_end
) — The truncation mode to use, eitherkeep_end
orkeep_start
. This argument is required if you want to use the default data collator. -
train_dataset (
datasets.Dataset
) — The dataset to use for training. -
eval_dataset (
datasets.Dataset
) — The dataset to use for evaluation. -
tokenizer (
transformers.PreTrainedTokenizerBase
) — The tokenizer to use for training. This argument is required if you want to use the default data collator. -
model_init (
Callable[[], transformers.PreTrainedModel]
) — The model initializer to use for training. If None is specified, the default model initializer will be used. -
callbacks (
List[transformers.TrainerCallback]
) — The callbacks to use for training. -
optimizers (
Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]
) — The optimizer and scheduler to use for training. -
preprocess_logits_for_metrics (
Callable[[torch.Tensor, torch.Tensor], torch.Tensor]
) — The function to use to preprocess the logits before computing the metrics. -
max_length (
int
, defaults toNone
) — The maximum length of the sequences in the batch. This argument is required if you want to use the default data collator. -
max_prompt_length (
int
, defaults toNone
) — The maximum length of the prompt. This argument is required if you want to use the default data collator. -
peft_config (
Dict
, defaults toNone
) — The PEFT configuration to use for training. If you pass a PEFT configuration, the model will be wrapped in a PEFT model. -
disable_dropout (
bool
, defaults toTrue
) — Whether or not to disable dropouts inmodel
andref_model
.
Initialize DPOTrainer.
concatenated_forward
< source >( model: Module batch: typing.Dict[str, typing.Union[typing.List, torch.LongTensor]] )
Run the given model on the given batch of inputs, concatenating the chosen and rejected inputs together.
We do this to avoid doing two forward passes, because it’s faster for FSDP.
concatenated_inputs
< source >( batch: typing.Dict[str, typing.Union[typing.List, torch.LongTensor]] )
Concatenate the chosen and rejected inputs into a single tensor.
dpo_loss
< source >( policy_chosen_logps: FloatTensor policy_rejected_logps: FloatTensor reference_chosen_logps: FloatTensor reference_rejected_logps: FloatTensor reference_free: bool = False ) → A tuple of three tensors
Returns
A tuple of three tensors
(losses, chosen_rewards, rejected_rewards). The losses tensor contains the DPO loss for each example in the batch. The chosen_rewards and rejected_rewards tensors contain the rewards for the chosen and rejected responses, respectively.
Compute the DPO loss for a batch of policy and reference model log probabilities.
get_batch_metrics
< source >( model batch: typing.Dict[str, typing.Union[typing.List, torch.LongTensor]] train_eval: typing.Literal['train', 'eval'] = 'train' )
Compute the DPO loss and other metrics for the given batch of inputs for train or test.
Generate samples from the model and reference model for the given batch of inputs.
log
< source >( logs: typing.Dict[str, float] )
Log logs
on the various objects watching training, including stored metrics.
DDPOConfig
class trl.DDPOConfig
< source >( run_name: typing.Optional[str] = '' seed: typing.Optional[int] = 42 logdir: typing.Optional[str] = 'logs' log_with: typing.Optional[str] = None tracker_kwargs: typing.Optional[dict] = <factory> accelerator_kwargs: typing.Optional[dict] = <factory> project_kwargs: typing.Optional[dict] = <factory> tracker_project_name: typing.Optional[str] = 'trl' num_epochs: typing.Optional[int] = 100 save_freq: typing.Optional[int] = 1 num_checkpoint_limit: typing.Optional[int] = 5 mixed_precision: typing.Optional[str] = 'fp16' allow_tf32: typing.Optional[bool] = True resume_from: typing.Optional[str] = '' sample_num_steps: typing.Optional[int] = 50 sample_eta: typing.Optional[float] = 1.0 sample_guidance_scale: typing.Optional[float] = 5.0 sample_batch_size: typing.Optional[int] = 1 sample_num_batches_per_epoch: typing.Optional[int] = 2 train_batch_size: typing.Optional[int] = 1 train_use_8bit_adam: typing.Optional[bool] = False train_learning_rate: typing.Optional[float] = 0.0003 train_adam_beta1: typing.Optional[float] = 0.9 train_adam_beta2: typing.Optional[float] = 0.999 train_adam_weight_decay: typing.Optional[float] = 0.0001 train_adam_epsilon: typing.Optional[float] = 1e-08 train_gradient_accumulation_steps: typing.Optional[int] = 1 train_max_grad_norm: typing.Optional[float] = 1.0 train_num_inner_epochs: typing.Optional[int] = 1 train_cfg: typing.Optional[bool] = True train_adv_clip_max: typing.Optional[float] = 5 train_clip_range: typing.Optional[float] = 0.0001 train_timestep_fraction: typing.Optional[float] = 1.0 per_prompt_stat_tracking: typing.Optional[bool] = False per_prompt_stat_tracking_buffer_size: typing.Optional[int] = 16 per_prompt_stat_tracking_min_count: typing.Optional[int] = 16 async_reward_computation: typing.Optional[bool] = False max_workers: typing.Optional[int] = 2 negative_prompts: typing.Optional[str] = '' )
Configuration class for DDPOTrainer
DDPOTrainer
class trl.DDPOTrainer
< source >( config: DDPOConfig reward_function: typing.Callable[[torch.Tensor, typing.Tuple[str], typing.Tuple[typing.Any]], torch.Tensor] prompt_function: typing.Callable[[], typing.Tuple[str, typing.Any]] sd_pipeline: DDPOStableDiffusionPipeline image_samples_hook: typing.Union[typing.Callable[[typing.Any, typing.Any, typing.Any], typing.Any], NoneType] = None )
Parameters
-
**config** (
DDPOConfig
) — Configuration object for DDPOTrainer. Check the documentation ofPPOConfig
for more — details. - **reward_function** (Callable[[torch.Tensor, Tuple[str], Tuple[Any]], torch.Tensor]) — Reward function to be used —
- **prompt_function** (Callable[[], Tuple[str, Any]]) — Function to generate prompts to guide model —
-
**sd_pipeline** (
DDPOStableDiffusionPipeline
) — Stable Diffusion pipeline to be used for training. — - **image_samples_hook** (Optional[Callable[[Any, Any, Any], Any]]) — Hook to be called to log images —
The DDPOTrainer uses Deep Diffusion Policy Optimization to optimise diffusion models. Note, this trainer is heavily inspired by the work here: https://github.com/kvablack/ddpo-pytorch As of now only Stable Diffusion based pipelines are supported
calculate_loss
< source >( latents timesteps next_latents log_probs advantages embeds )
Parameters
- latents (torch.Tensor) — The latents sampled from the diffusion model, shape: [batch_size, num_steps, …]
- timesteps (torch.Tensor) — The timesteps sampled from the diffusion model, shape: [batch_size]
- next_latents (torch.Tensor) — The next latents sampled from the diffusion model, shape: [batch_size, num_steps, …]
- log_probs (torch.Tensor) — The log probabilities of the latents, shape: [batch_size]
- advantages (torch.Tensor) — The advantages of the latents, shape: [batch_size]
- embeds (torch.Tensor) — The embeddings of the prompts, shape: [2*batch_size or batch_size, …] Note: the “or” is because if train_cfg is True, the expectation is that negative prompts are concatenated to the embeds
Calculate the loss for a batch of an unpacked sample
step
< source >( epoch: int global_step: int ) → global_step (int)
Perform a single step of training.
Side Effects:
- Model weights are updated
- Logs the statistics to the accelerator trackers.
- If
self.image_samples_callback
is not None, it will be called with the prompt_image_pairs, global_step, and the accelerator tracker.
Train the model for a given number of epochs
set_seed
Helper function for reproducible behavior to set the seed in random
, numpy
, and torch
.