
from octoai.client import Client
client = Client()
completion = client.chat.completions.create(
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Keep your responses limited."
},
{
"role": "user",
"content": "Hello world"
}
],
model="llama-2-13b-chat-fp16",
max_tokens=1000,
presence_penalty=0,
temperature=1,
top_p=1,
)
Token generation at human speed
OctoAI offers easy multi-GPU inference for the largest models like Llama 2 70B to unlock the most capable models while also offering quantized versions of all base models for faster and lower-cost applications.
OctoAI allows customers to run Llama 2 70B with a variety of options for hitting latency and quality targets.
Your data's accuracy, our blazing speeds
Lightning-fast runtime LLM behavior modification through the application of LoRAs. Contact us to accelerate your checkpoints.
Fine tune Llama 2 for your use-case
You can use pre-finetuned models on OctoAI, like: Llama Chat, fine-tuned on public instruction datasets. Or, coming soon to OctoAI, Code Llama, trained on 500B tokens of code in Python, C++, Java, Javascript, C#, and Bash.
Llama 2 on OctoAI features
Features OctoAI Bring your own fine tunes and checkpoints
Coming soon
via fine tuning
Token based pricing and metering
Coming soon
on all models
Features | OctoAI |
---|---|
Bring your own fine tunes and checkpoints |
|
Token based pricing and metering |
|