Skip to main content
There are a few ways to optimize costs for your model deployments:

Using quantization or lower precision

Quantization reduces model memory usage by representing weights and activations with fewer bits, which directly translates to lower infrastructure costs. This technique also typically increases inference speed by reducing memory bandwidth requirements and computational overhead. We support several quantization methods, including INT4 and INT8, as well as FP8. We highly recommend using quantizations for model deployments that can use it.

Memory and Cost Benefits

Different quantization methods offer varying levels of memory reduction:
  • 4-bit quantization (like INT4): Reduces memory usage by approximately 75% compared to FP16. For example, you can run inference on a 33B parameter model on in only 24GB of GPU memory or a 65B parameter model on 46GB of GPU memory.
  • 8-bit quantization (INT8/FP8): Provides about 50% memory reduction while maintaining excellent model quality. This typically increases inference speed by 1.5x to 2x.

Using a shorter context window for LLMs

Using a shorter context window reduces the amount of memory needed to store the context, which directly translates to lower infrastructure costs. For example, if you have a LLM with a context window of 32k tokens, and you reduce that to 16k tokens, you will typically allocate 50% less VRAM for context, which can save you a significant amount of money.

Spend limits

Spend Limits allow you to set maximum spending thresholds for a model over a given time period. When a limit is reached, further usage may be restricted or paused to help you control costs and avoid unexpected charges. You can create up to 3 spend limits per model. Spend limit example: maximum spending of 120$ for 48 hours. To create a spend limit, go to the model page and click on the “Spend limits” tab. Then click on “Add Rule” and fill out the form.
I