Skip to main content
Deploying a model makes it ready to serve inference on SynapsAI Cloud. This page explains the requirements and the settings you configure when deploying.

Requirements

Before you begin, make sure you have:
  • A Hugging Face account and an access token with read permission.
  • A Hugging Face repository (private or public) containing all required files for the model, including weights in safetensors format.
    • It must include the model files, and any required processor (e.g., tokenizer, image processor, feature extractor).

Deploy a model

Deploy a model from the web platform.

Important Notes

  • We do not support custom code on Hugging Face repositeries. However, we are working on adding this feature.
  • Model weights must be in safetensors format.
  • The pipeline tag in the README.md file is required to deploy the model. For example: pipeline_tag: text-generation
If you need help preparing your repository, see: Set up a custom model.

Readiness level

Choose how quickly model instances should be able to serve requests:
  • Always ready
    • At least one model instance is kept running and ready to serve.
    • Uses our fast scaling technology for immediate capacity increases.
    • You pay for this model 24/7.
  • Super fast (recommended)
    • Model instances are prepared to load very quickly when they receive traffic.
    • Startup times are minimized while keeping costs lower than always-on.
    • You pay for this model only when it has active instances.
  • Cold start
    • The model is downloaded from Hugging Face on demand before serving requests.
    • Lowest baseline cost, highest first-request latency.
    • You pay for this model only when it has active instances.

Precision

Select the numeric precision for running your model. Higher precision uses more memory, which can increase cost, while lower precision reduces memory usage.
  • Examples: Float32 (higher memory), BFloat16 (lower memory)
  • See examples and pricing notes in Core concepts.

Quantization

If your model supports quantization, you can select from available options such as EETQ or FP8. Quantization reduces memory and can improve throughput, with some impact on accuracy depending on the method.

Worker timeout

If a model instance receives no requests during the configured timeout period, it will shut down automatically. This helps control idle costs.

What you configure during deployment

  • Model source: the Hugging Face repository and revision (branch, tag, or commit)
  • Readiness level: Always ready, Super fast, or Cold start
  • Precision and quantization (if supported)
  • Resource limits and scaling policy (minimum/maximum instances)
  • Worker timeout
All estimated costs are shown before you deploy.
Need help? Feel free to contact support at any time at [email protected].
I