Overview
Let’s have a quick look at the 🤗 Hosted Inference API.
Main features:
- Leverage 150,000+ Transformer, Diffusers, or Timm models (T5, Blenderbot, Bart, GPT-2, Pegasus...)
- Upload, manage and serve your own models privately
- Run Classification, NER, Conversational, Summarization, Translation, Question-Answering, Embeddings Extraction tasks
- Get up to 10x inference speedup to reduce user latency
- Accelerated inference for a number of supported models on CPU
- Run large models that are challenging to deploy in production
- Scale up to 1,000 requests per second with automatic scaling built-in
- Ship new NLP, CV, Audio, or RL features faster as new models become available
- Build your business on a platform powered by the reference open source project in ML
Get your API Token
To get started you need to:
- Register or Login.
- Get a User Access or API token in your Model Database profile settings.
You should see a token hf_xxxxx
(old tokens are api_XXXXXXXX
or api_org_XXXXXXX
).
If you do not submit your API token when sending requests to the API, you will not be able to run inference on your private models.
Running Inference with API Requests
The first step is to choose which model you are going to run. Go to the Model Hub and select the model you want to use. If you are unsure where to start, make sure to check the recommended models for each ML task available, or the Tasks overview.
ENDPOINT = https://api-inference.huggingface.co/models/<MODEL_ID>
Let’s use gpt2 as an example. To run inference, simply use this code:
import json
import requests
API_URL = "https://api-inference.huggingface.co/models/gpt2"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
def query(payload):
data = json.dumps(payload)
response = requests.request("POST", API_URL, headers=headers, data=data)
return json.loads(response.content.decode("utf-8"))
data = query("Can you please let us know more details about your ")
API Options and Parameters
Depending on the task (aka pipeline) the model is configured for, the
request will accept specific parameters. When sending requests to run
any model, API options allow you to specify the caching and model
loading behavior. All API options and
parameters are detailed here detailed_parameters
.
Using CPU-Accelerated Inference
As an API customer, your API token will automatically enable CPU-Accelerated inference on your requests if the model type is supported. For instance, if you compare gpt2 model inference through our API with CPU-Acceleration, compared to running inference on the model out of the box on a local setup, you should measure a ~10x speedup. The specific performance boost depends on the model and input payload (and your local hardware).
To verify you are using the CPU-Accelerated version of a model you can check the x-compute-type header of your requests, which should be cpu+optimized. If you do not see it, it simply means not all optimizations are turned on. This can be for various factors; the model might have been added recently to transformers, or the model can be optimized in several different ways and the best one depends on your use case.
If you contact us at [email protected], we’ll be able to increase the inference speed for you, depending on your actual use case.
Model Loading and latency
The Hosted Inference API can serve predictions on-demand from over 100,000 models deployed on the Model Database Hub, dynamically loaded on shared infrastructure. If the requested model is not loaded in memory, the Hosted Inference API will start by loading the model into memory and returning a 503 response, before it can respond with the prediction.
If your use case requires large volume or predictable latencies, you can use our paid solution Inference Endpoints to easily deploy your models on dedicated, fully-managed infrastructure. With Inference Endpoints you can quickly create endpoints on the cloud, region, CPU or GPU compute instance of your choice.