Run training on Amazon SageMaker
This guide will show you how to train a 🤗 Transformers model with the HuggingFace
SageMaker Python SDK. Learn how to:
- Install and setup your training environment.
- Prepare a training script.
- Create a Model Database Estimator.
- Run training with the
fit
method. - Access your trained model.
- Perform distributed training.
- Create a spot instance.
- Load a training script from a GitHub repository.
- Collect training metrics.
Installation and setup
Before you can train a 🤗 Transformers model with SageMaker, you need to sign up for an AWS account. If you don’t have an AWS account yet, learn more here.
Once you have an AWS account, get started using one of the following:
- SageMaker Studio
- SageMaker notebook instance
- Local environment
To start training locally, you need to setup an appropriate IAM role.
Upgrade to the latest sagemaker
version:
pip install sagemaker --upgrade
SageMaker environment
Setup your SageMaker environment as shown below:
import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
Note: The execution role is only available when running a notebook within SageMaker. If you run get_execution_role
in a notebook not on SageMaker, expect a region
error.
Local environment
Setup your local environment as shown below:
import sagemaker
import boto3
iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='role-name-of-your-iam-role-with-right-permissions')['Role']['Arn']
sess = sagemaker.Session()
Prepare a 🤗 Transformers fine-tuning script
Our training script is very similar to a training script you might run outside of SageMaker. However, you can access useful properties about the training environment through various environment variables (see here for a complete list), such as:
SM_MODEL_DIR
: A string representing the path to which the training job writes the model artifacts. After training, artifacts in this directory are uploaded to S3 for model hosting.SM_MODEL_DIR
is always set to/opt/ml/model
.SM_NUM_GPUS
: An integer representing the number of GPUs available to the host.SM_CHANNEL_XXXX:
A string representing the path to the directory that contains the input data for the specified channel. For example, when you specifytrain
andtest
in the Model Database Estimatorfit
method, the environment variables are set toSM_CHANNEL_TRAIN
andSM_CHANNEL_TEST
.
The hyperparameters
defined in the Model Database Estimator are passed as named arguments and processed by ArgumentParser()
.
import transformers
import datasets
import argparse
import os
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script
parser.add_argument("--epochs", type=int, default=3)
parser.add_argument("--per_device_train_batch_size", type=int, default=32)
parser.add_argument("--model_name_or_path", type=str)
# data, model, and output directories
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])
Note that SageMaker doesn’t support argparse actions. For example, if you want to use a boolean hyperparameter, specify type
as bool
in your script and provide an explicit True
or False
value.
Look here for a complete example of a 🤗 Transformers training script.
Training Output Management
If output_dir
in the TrainingArguments
is set to ‘/opt/ml/model’ the Trainer saves all training artifacts, including logs, checkpoints, and models. Amazon SageMaker archives the whole ‘/opt/ml/model’ directory as model.tar.gz
and uploads it at the end of the training job to Amazon S3. Depending on your Hyperparameters and TrainingArguments
this could lead to a large artifact (> 5GB), which can slow down deployment for Amazon SageMaker Inference.
You can control how checkpoints, logs, and artifacts are saved by customization the TrainingArguments. For example by providing save_total_limit
as TrainingArgument
you can control the limit of the total amount of checkpoints. Deletes the older checkpoints in output_dir
if new ones are saved and the maximum limit is reached.
In addition to the options already mentioned above, there is another option to save the training artifacts during the training session. Amazon SageMaker supports Checkpointing, which allows you to continuously save your artifacts during training to Amazon S3 rather than at the end of your training. To enable Checkpointing you need to provide the checkpoint_s3_uri
parameter pointing to an Amazon S3 location in the HuggingFace
estimator and set output_dir
to /opt/ml/checkpoints
.
Note: If you set output_dir
to /opt/ml/checkpoints
make sure to call trainer.save_model("/opt/ml/model")
or model.save_pretrained(“/opt/ml/model”)/tokenizer.save_pretrained("/opt/ml/model")
at the end of your training to be able to deploy your model seamlessly to Amazon SageMaker for Inference.
Create a Model Database Estimator
Run 🤗 Transformers training scripts on SageMaker by creating a Model Database Estimator. The Estimator handles end-to-end SageMaker training. There are several parameters you should define in the Estimator:
entry_point
specifies which fine-tuning script to use.instance_type
specifies an Amazon instance to launch. Refer here for a complete list of instance types.hyperparameters
specifies training hyperparameters. View additional available hyperparameters here.
The following code sample shows how to train with a custom script train.py
with three hyperparameters (epochs
, per_device_train_batch_size
, and model_name_or_path
):
from sagemaker.huggingface import HuggingFace
# hyperparameters which are passed to the training job
hyperparameters={'epochs': 1,
'per_device_train_batch_size': 32,
'model_name_or_path': 'distilbert-base-uncased'
}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.26',
pytorch_version='1.13',
py_version='py39',
hyperparameters = hyperparameters
)
If you are running a TrainingJob
locally, define instance_type='local'
or instance_type='local-gpu'
for GPU usage. Note that this will not work with SageMaker Studio.
Execute training
Start your TrainingJob
by calling fit
on a Model Database Estimator. Specify your input training data in fit
. The input training data can be a:
- S3 URI such as
s3://my-bucket/my-training-data
. FileSystemInput
for Amazon Elastic File System or FSx for Lustre. See here for more details about using these file systems as input.
Call fit
to begin training:
huggingface_estimator.fit(
{'train': 's3://sagemaker-us-east-1-558105141721/samples/datasets/imdb/train',
'test': 's3://sagemaker-us-east-1-558105141721/samples/datasets/imdb/test'}
)
SageMaker starts and manages all the required EC2 instances and initiates the TrainingJob
by running:
/opt/conda/bin/python train.py --epochs 1 --model_name_or_path distilbert-base-uncased --per_device_train_batch_size 32
Access trained model
Once training is complete, you can access your model through the AWS console or download it directly from S3.
from sagemaker.s3 import S3Downloader
S3Downloader.download(
s3_uri=huggingface_estimator.model_data, # S3 URI where the trained model is located
local_path='.', # local path where *.targ.gz is saved
sagemaker_session=sess # SageMaker session used for training the model
)
Distributed training
SageMaker provides two strategies for distributed training: data parallelism and model parallelism. Data parallelism splits a training set across several GPUs, while model parallelism splits a model across several GPUs.
Data parallelism
The Model Database Trainer supports SageMaker’s data parallelism library. If your training script uses the Trainer API, you only need to define the distribution parameter in the Model Database Estimator:
# configuration for running training on smdistributed data parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3dn.24xlarge',
instance_count=2,
role=role,
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
hyperparameters = hyperparameters,
distribution = distribution
)
📓 Open the notebook for an example of how to run the data parallelism library with TensorFlow.
Model parallelism
The Model Database [Trainer] also supports SageMaker’s model parallelism library. If your training script uses the Trainer API, you only need to define the distribution parameter in the Model Database Estimator (see here for more detailed information about using model parallelism):
# configuration for running training on smdistributed model parallel
mpi_options = {
"enabled" : True,
"processes_per_host" : 8
}
smp_options = {
"enabled":True,
"parameters": {
"microbatches": 4,
"placement_strategy": "spread",
"pipeline": "interleaved",
"optimize": "speed",
"partitions": 4,
"ddp": True,
}
}
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3dn.24xlarge',
instance_count=2,
role=role,
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
hyperparameters = hyperparameters,
distribution = distribution
)
📓 Open the notebook for an example of how to run the model parallelism library.
Spot instances
The Model Database extension for the SageMaker Python SDK means we can benefit from fully-managed EC2 spot instances. This can help you save up to 90% of training costs!
Note: Unless your training job completes quickly, we recommend you use checkpointing with managed spot training. In this case, you need to define the checkpoint_s3_uri
.
Set use_spot_instances=True
and define your max_wait
and max_run
time in the Estimator to use spot instances:
# hyperparameters which are passed to the training job
hyperparameters={'epochs': 1,
'train_batch_size': 32,
'model_name':'distilbert-base-uncased',
'output_dir':'/opt/ml/checkpoints'
}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
checkpoint_s3_uri=f's3://{sess.default_bucket()}/checkpoints'
use_spot_instances=True,
# max_wait should be equal to or greater than max_run in seconds
max_wait=3600,
max_run=1000,
role=role,
transformers_version='4.26',
pytorch_version='1.13',
py_version='py39',
hyperparameters = hyperparameters
)
# Training seconds: 874
# Billable seconds: 262
# Managed Spot Training savings: 70.0%
📓 Open the notebook for an example of how to use spot instances.
Git repository
The Model Database Estimator can load a training script stored in a GitHub repository. Provide the relative path to the training script in entry_point
and the relative path to the directory in source_dir
.
If you are using git_config
to run the 🤗 Transformers example scripts, you need to configure the correct 'branch'
in transformers_version
(e.g. if you use transformers_version='4.4.2
you have to use 'branch':'v4.4.2'
).
Tip: Save your model to S3 by setting output_dir=/opt/ml/model
in the hyperparameter of your training script.
# configure git settings
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.4.2'} # v4.4.2 refers to the transformers_version you use in the estimator
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='run_glue.py',
source_dir='./examples/pytorch/text-classification',
git_config=git_config,
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.26',
pytorch_version='1.13',
py_version='py39',
hyperparameters=hyperparameters
)
SageMaker metrics
SageMaker metrics automatically parses training job logs for metrics and sends them to CloudWatch. If you want SageMaker to parse the logs, you must specify the metric’s name and a regular expression for SageMaker to use to find the metric.
# define metrics definitions
metric_definitions = [
{"Name": "train_runtime", "Regex": "train_runtime.*=\D*(.*?)$"},
{"Name": "eval_accuracy", "Regex": "eval_accuracy.*=\D*(.*?)$"},
{"Name": "eval_loss", "Regex": "eval_loss.*=\D*(.*?)$"},
]
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./scripts',
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.26',
pytorch_version='1.13',
py_version='py39',
metric_definitions=metric_definitions,
hyperparameters = hyperparameters)
📓 Open the notebook for an example of how to capture metrics in SageMaker.