Inference Endpoints - Model Database

NEW

Deploy LLama 2 (Chat 7B and 13B) in a few clicks on Inference Endpoints

Machine Learning At Your Service

With Inference Endpoints, easily deploy Transformers, Diffusers or any model on dedicated, fully managed infrastructure. Keep your costs low with our secure, compliant and flexible production solution.

Deploy your first model or Read the docs

Production Inference Made Easy

Deploy models on dedicated and secure infrastructure without dealing with containers and GPUs

Deploy models with just a few clicks: Turn your models into production ready APIs, without having to deal with infrastructure or MLOps.

Keep your production costs down: Leverage a fully-managed production solution for inference and just pay as you go for the raw compute you use.

Enterprise Security: Deploy models into secure offline endpoints only accessible via direct connection to your Virtual Private Cloud (VPCs).

How It Works

Deploy models for production in a few simple steps

1. Select your model

Select the model you want to deploy. You can deploy a custom model or any of the 60,000+ Transformers, Diffusers or Sentence Transformers models available on the Hub for NLP, computer vision, or speech tasks.

2. Choose your cloud

Pick your cloud and select a region close to your data in compliance with your requirements (e.g. Europe, North America or Asia Pacific).

3. Select your security level

Protected Endpoints are accessible from the Internet and require valid authentication.

Public Endpoints are accessible from the Internet and do not require authentication.

Private Endpoints are only available through an intra-region secured AWS or Azure PrivateLink direct connection to a VPC and are not accessible from the Internet.

4. Create and manage your endpoint

Click create and your new endpoint is ready in a couple of minutes. Define autoscaling, access logs and monitoring, set custom metrics routes, manage endpoints programmatically with API/CLI, and rollback models - all super easily.

Read the docs

A Better Way to Go to Production

Scale your machine learning while keeping your costs low

Before: 🤼

Struggle with MLOps and building the right infrastructure for production.

🐢

Wasted time deploying models slows down ML development.

😓

Deploying models in a compliant and secure way is difficult & time-consuming.

❌

87% of data science projects never make it into production.

After: 🤝

Don't worry about infrastructure or MLOps, spend more time building models.

🚀

A fully-managed solution for model inference accelerates your ML roadmap.

🔒

Easily deploy your models in a secure and compliant environment.

✅

Seamless model deployment bridges the gap from research to production.

Customer Success Stories

Learn how leading AI teams use Inference Endpoints to deploy models

Endpoints for Music

Customer

Musixmatch is the world’s leading music data company

Use Case

Custom text embeddings generation pipeline

Models Deployed

Distilbert-base-uncased-finetuned-sst-2-english

facebook/wav2vec2-base-960h

Custom model based on sentence transformers

The coolest thing was how easy it was to define a complete custom interface from the model to the inference process. It just took us a couple of hours to adapt our code, and have a functioning and totally custom endpoint.

Andrea Boscarino

Data Scientist at Musixmatch

Pricing

Pay for CPU & GPU compute resources

🛠Self-serve

🏢Enterprise

Inference Endpoints

Pay for compute resources uptime by the minute, billed monthly.

As low as $0.06 per CPU core/hr and $0.6 per GPU/hr.
Email Support

Email support and no SLAs.

Deploy your first model

Inference Endpoints

Custom pricing based on volume commit and annual contracts.
Dedicated Support & SLAs

Dedicated support, 24/7 SLAs, and uptime guarantees.

Request a Quote

Start now with Inference Endpoints!

Deploy models in a few clicks 🤯
Pay for compute resources uptime, by the minute.

Deploy your first model

Machine Learning At Your Service

Production Inference Made Easy

Deploy models on dedicated and secure infrastructure without dealing with containers and GPUs