Batch API Most affordable batch API

Now available
with DeepSeek R1-Distill

Solution How we do it?


Data centers have *many* short intervals of unused compute time—minutes or hours—that go to waste. Traditional systems cannot efficiently capture these brief windows, and once they are gone, the opportunity is lost.

At EXXA, we have created a custom scheduler and orchestrator that aggregates these unused fragments across multiple data centers, enabling us to run AI workloads efficiently on underutilized compute acquired at a discount.

We then pass those savings on to you.

Custom
scheduler

Maximize use of intermittent and low-costs compute with custom scheduler

Predictive inference
optimizer

Use optimal settings for each payload to process (incl. batch size, context size)

Specialized inference
engine

Custom inference engine optimized for batch API (incl. persistent KV cache, cross-platform and cross-GPU)

Self-training
draft model

Train smaller draft model on large batches to reduce workload of larger models and gain efficiencies

Pricing Per-token rates

Base model Context window Delay Input tokens Prompt caching Output tokens
llama-3.1-8b
instruct-fp16
128K tokens 24h $0.10 / M tokens Write: $0.10 / M tokens
Read: $0.02 / M tokens
$0.15 / M tokens
llama-3.3-70b
instruct-fp16
128K tokens 24h $0.30 / M tokens Write: $0.30 / M tokens
Read: $0.06 / M tokens
$0.50 / M tokens
deepseek-r1-distill
llama-3.3-70b-fp16
128K tokens 24h $0.30 / M tokens Write: $0.30 / M tokens
Read: $0.06 / M tokens
$0.50 / M tokens
llama-3.1-nemotron-70b
instruct-fp16
128K tokens 24h $0.30 / M tokens Write: $0.30 / M tokens
Read: $0.06 / M tokens
$0.50 / M tokens
beta:qwen-2-vl-72b
instruct-fp16
32K tokens 24h $0.30 / M tokens Write: $0.30 / M tokens
Read: $0.06 / M tokens
$0.50 / M tokens
Note: If you want access to other models, please contact us at founders@withexxa.com

Start using it today!

Get started

F.A.Q.

Use EXXA API
EXXA API is live and available. The Batch API endpoint, as documented here enables developers to submit request for asynchronous batch processing. Those requests will be processed under 24 hours.
Models supported
All models supported by EXXA API are listed here, including the beta version of Qwen-2-VL-72B. If you are interested in other models or in other modalities, please contact us.
Output tokens
The output size depends on the model. You can find the maximum output size for each model here. If you need higher output size please contact us.
Rate limits
There is no hard rate limit in terms of dataset size. However, you need to have enough credits on your account to launch the queries.
Pricing
When using EXXA off-peak API, you only pay for the input and output tokens. You can easily credit your account on the EXXA console. See Pricing section above for pricing details.
Countries supported
Payments are currently supported for the United States and France. More countries will follow. Contact us if your country is not listed and want access to the API.
Batch cancellation
It is possible to manually cancel any batch or request at any moment. If a batch is manually cancelled, any queries that have already been processed can be retrieved. Developers will be only charged for completed work.
Data retention
Data on EXXA API endpoint will be stored seven days after completion. EXXA is committed to privacy and trust for all our solutions. You own and control all the data shared to EXXA.
Private deployment
EXXA offers an entreprise solution to deploy such an LLM inference orchestrator on-premise or on Virtual Private Cloud. This is especially useful to maximize GPUs usage. Contact us to know more.