Off-Peak Computing is the most cost-efficient batch inference API for open-source models Get high quality output, at cheapest price, for all use cases when you can tolerate some delay!
Get all your
requests under
than 24 hours.
Often much faster!
Per million tokens
for Llama-3.1-70b-Instruct FP16
($0.30/0.50 input/ouput per million tokens)
Hard rate
limit
E.g. Use Llama-3.1-70b
as a judge to evaluate the generation
performance in a RAG application
every night
E.g. Use Llama-3.1-70b to generate a bit of context for each chunk
to improve the performance of RAG applications.
E.g. Classify large datasets
of documents, customer feedbacks
or news documents
on a daily basis
E.g. Translate large volume
of texts on multiple languages
using high performing models like llama-3-70B
E.g. Extract data from large documents in a specific format. Use structured output feature on EXXA API.
E.g. Synthetize customers or internal chatbot conversations on a daily basis to reduce storage requirements.