Data centers have *many* short intervals of unused compute time—minutes or hours—that go to waste.
Traditional systems cannot efficiently capture these brief windows, and once they are gone, the opportunity is lost.
At EXXA, we have created a custom scheduler and orchestrator that aggregates these unused fragments across multiple data centers,
enabling us to run AI workloads efficiently on underutilized compute acquired at a discount.
We then pass those savings on to you.
Maximize use of intermittent and low-costs compute with custom scheduler
Use optimal settings for each payload to process (incl. batch size, context size)
Custom inference engine optimized for batch API (incl. persistent KV cache, cross-platform and cross-GPU)
Train smaller draft model on large batches to reduce workload of larger models and gain efficiencies
Base model | Context window | Delay | Input tokens | Prompt caching | Output tokens |
llama-3.1-8b instruct-fp16 |
128K tokens | 24h | $0.10 / M tokens | Write: $0.10 / M tokens Read: $0.02 / M tokens |
$0.15 / M tokens |
llama-3.3-70b instruct-fp16 |
128K tokens | 24h | $0.30 / M tokens | Write: $0.30 / M tokens Read: $0.06 / M tokens |
$0.50 / M tokens |
deepseek-r1-distill llama-3.3-70b-fp16 |
128K tokens | 24h | $0.30 / M tokens | Write: $0.30 / M tokens Read: $0.06 / M tokens |
$0.50 / M tokens |
llama-3.1-nemotron-70b instruct-fp16 |
128K tokens | 24h | $0.30 / M tokens | Write: $0.30 / M tokens Read: $0.06 / M tokens |
$0.50 / M tokens |
beta:qwen-2-vl-72b instruct-fp16 |
32K tokens | 24h | $0.30 / M tokens | Write: $0.30 / M tokens Read: $0.06 / M tokens |
$0.50 / M tokens |