EXXA - Cheapest and greener API for Gen-AI inference

Solution Key benefits

Introducing Off-Peal Computing, the first asynchronous inference service for open-source models optimized for costs and environmentally conscious.
Get high quality output, at cheapest price, for all use cases when you DO NOT need instantaneous answers!

ASYNCHRONOUS

24h

Get all your
requests in less
than 24 hours.

CHEAPEST

$0.34

Per million tokens
for Llama-3.1-70b-Instruct
($0.30/0.50 input/ouput)

LIMITLESS

Hard rate
limit

EXXA API Key principles

With EXXA asynchronous API, we optimize LLM inference to maximize the number of tokens generated per unit of electricity consummed (KWh).

Transparent
API

Do not pay for watered down models. We tell you what models run with which optimization.

Most
affordable

We are committed to building the most efficient inference services for asynchronous tasks

Best
experience

Create batches and automatically get notified when it was processed. Do not worry about rate limits.

Solution Key use cases

Evaluation

E.g. Use Llama-3.1-70b
as a judge to evaluate the generation
performance in a RAG application
every night

Classification

E.g. Classify large datasets
of documents, customer feedbacks
or news documents
on a daily basis

Translation

E.g. Translate large volume
of texts on multiple languages
using high performing models like llama-3-70B

Parsing

E.g. Extract data from large documents in a specific format. Use structured output feature on EXXA API.

Synthesis

E.g. Synthetize customers or internal chatbot conversations on a daily basis to reduce storage requirements.

Complex analysis

E.g. Run complex analysis
on large documents, such as IP infringement investigations on patents.

How we do it Most affordable
LLM inference

At EXXA, we are dedicated to offer the lowest price possible to process LLM inference. By removing latency constraints from our API, we are able to unlock massive optimizations by using cheaper compute, different hardware and use a custom-built inference engine specialized for such workloads.

Custom
scheduler

Maximize use of intermittent and low-costs compute with custom scheduler

Predictive inference
optimizer

Use optimal settings for each payload to process (incl. batch size, context size)

Specialized inference
engine

Custom inference engine optimized for batch API (incl. persistent KV cache, cross-platform and cross-GPU)

Self-training
draft model

Train smaller draft model on large batches to reduce workload of larger models and gain efficiencies

EXXA Sustainability
commitments

The environmental footprint of Generative AI is substantial, with most recent projections showing up to 5x increase in digital CO2 emissions by 2030. At EXXA, we are dedicated to developing the most efficient LLM inference services while minimizing environmental impact. Our commitment to sustainability is built on the following four key principles.

Transparency

Measurement is the first step to any reduction. EXXA provides detailed energy consumption data for each request through our API.

Low-emissions GPUs

We prioritize the use of GPUs in regions with low-carbon electricity and schedule operations during off-peak hours to further reduce emissions when possible.

Optimal use

We maximize GPU efficiency through advanced technical solutions, aiming for the highest performance with the least environmental impact.

Carbon credit

We offer an easy option to offset any remaining carbon footprint by purchasing certified carbon credits directly through our platform.

Start using it today!

Get started

F.A.Q.

Use EXXA API ﹢

EXXA API is live and available. The Batch API endpoint, as documented here enables developers to submit request for asynchronous batch processing. Those requests will be processed under 24 hours.

Models supported ﹢

EXXA API only supports llama-3.1-70b-instruct-fp16 for the moment. If you are interested in other language models or in other modalities, please contact us.

Output tokens ﹢

Models currently supported have a maximum output size of 4,098 tokens. If you need higher output size please contact us.

Rate limit ﹢

There is no hard rate limit in terms of dataset size. However, you need to have enough credits on your account to launch the queries.

Pricing ﹢

When using EXXA off-peak API, you only pay for the input and output tokens. You can easily credit your account on the EXXA interface. See Pricing section above for pricing details.

Countries supported ﹢

Payments are currently supported for the United States and France. More countries will follow. Contact us if you are country is not listed and want access to the API.

Batch cancellation ﹢

It is possible to manually cancel any batch or request at any moment. If a batch is manually cancelled, any queries that have already been processed can be retrieved. Developers will be charged for any completed work.

Batch expiration ﹢

If a batch is not processed under 24 hours, then remaining work is cancelled and any request completed can be retrieved. Developers will be charged for any request completed.

Data retention ﹢

Data on EXXA API endpoint will be stored seven days after completion. EXXA is committed to privacy and trust for all our solutions. You own and control all the data shared to EXXA.

Private deployment ﹢

EXXA offers an entreprise solution to deploy such an LLM inference orchestrator on-premise or on Virtual Private Cloud. This is especially useful to maximize GPUs usage. Contact us to know more.

Base model	Context window	Input tokens	Output tokens
llama-3.1-70b instruct-fp16	128K tokens	$0.30 / M tokens	$0.50 / M tokens
llama-3.1-405b instruct-fp16	128K tokens	Coming next
nemotron-4-340b reward-fp16	4K tokens	Coming next

Off-peak computing Most affordable and greener LLM Inference
A bit of patience is all you need!

Solution Key benefits