Requirements
Before you begin, make sure you have:- A Hugging Face account and an access token with read permission.
- A Hugging Face repository (private or public) containing all required files for the model, including weights in safetensors format.
- It must include the model files, and any required processor (e.g., tokenizer, image processor, feature extractor).
Deploy a model
Deploy a model from the web platform.
Important Notes
- We do not support custom code on Hugging Face repositeries. However, we are working on adding this feature.
- Model weights must be in safetensors format.
- The pipeline tag in the
README.mdfile is required to deploy the model. For example:pipeline_tag: text-generation
Readiness level
Choose how quickly model instances should be able to serve requests:- Always ready
- At least one model instance is kept running and ready to serve.
- Uses our fast scaling technology for immediate capacity increases.
- You pay for this model 24/7.
- Super fast (recommended)
- Model instances are prepared to load very quickly when they receive traffic.
- Startup times are minimized while keeping costs lower than always-on.
- You pay for this model only when it has active instances.
- Cold start
- The model is downloaded from Hugging Face on demand before serving requests.
- Lowest baseline cost, highest first-request latency.
- You pay for this model only when it has active instances.
Precision
Select the numeric precision for running your model. Higher precision uses more memory, which can increase cost, while lower precision reduces memory usage.- Examples: Float32 (higher memory), BFloat16 (lower memory)
- See examples and pricing notes in Core concepts.
Quantization
If your model supports quantization, you can select from available options such as EETQ or FP8. Quantization reduces memory and can improve throughput, with some impact on accuracy depending on the method.Worker timeout
If a model instance receives no requests during the configured timeout period, it will shut down automatically. This helps control idle costs.What you configure during deployment
- Model source: the Hugging Face repository and revision (branch, tag, or commit)
- Readiness level: Always ready, Super fast, or Cold start
- Precision and quantization (if supported)
- Resource limits and scaling policy (minimum/maximum instances)
- Worker timeout
Need help? Feel free to contact support at any time at [email protected].