For a quick glance at the parameters supported by the Chat Completions API, see the API reference.

The Client class allows you to run inferences simply to any model that accepts JSON-formatted inputs as a dictionary, and provides you with all JSON-formatted outputs as a dictionary. The Client class also supports the Chat Completions API and provides easy access to a set of highly optimized text models on OctoAI.

This guide will walk you through how to select your model of interest, how to call highly optimized text models on OctoAI using the Chat Completions API, and how to use the responses in both streaming and regular modes.

Requirements

  • Please create an OctoAI API token if you don’t have one already.
  • Please also verify you’ve completed Python SDK Installation & Setup.
    • If you use the OCTOAI_TOKEN envvar for your token, you can instantiate the OctoAI client with client = Client() after importing the octoai package.

Supported Models

The Client().chat.completions.create() method described in the next section requires a model= argument. The following snippet shows you how to get a list of supported models.

Python
>>> import octoai
>>> octoai.chat.get_model_list()

The list of available models is also detailed in the API reference. You can specify the model= argument either as a string or as a octoai.chat.TextModel enum instance, such as TextModel.LLAMA2_70B.

Text Generation

The following snippet shows you how to use the Chat Completions API to generate text using Llama2.

Python
import json

from octoai.chat import TextModel
from octoai.client import Client

client = Client()
completion = client.chat.completions.create(
    model=TextModel.LLAMA_2_70B_CHAT,
    messages=[
        {
            "role": "system",
            "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
        },
        {"role": "user", "content": "Write a blog about Seattle"},
    ],
    max_tokens=150,
)

print(json.dumps(completion.dict(), indent=2))

The response is of type octoai.chat.ChatCompletion. If you print the response from this call as in the example above, it looks similar to the following:

{
  "id": "cmpl-8ea213aece0747aca6d0608b02b57196",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Founded in 1921, Seattle is the mother city of Pacific Northwest. Seattle is the densely populated second-largest city in the state of Washington along with Portland. A small city at heart, Seattle has transformed itself from a small manufacturing town to the contemporary Pacific Northwest hub to its east. The city's charm and frequent unpredictability draw tourists and residents alike. Here are my favorite things about Seattle.\n* Seattle has a low crime rate and high quality of life.\n* Seattle has rich history which included the building of the first Pacific Northwest harbor and the development of the Puget Sound irrigation system. Seattle is also home to legendary firm Boeing.\n",
        "function_call": null
      },
      "delta": null,
      "finish_reason": "length"
    }
  ],
  "created": 5399,
  "model": "llama2-70b",
  "object": "chat.completion",
  "system_fingerprint": null,
  "usage": {
    "completion_tokens": 150,
    "prompt_tokens": 571,
    "total_tokens": 721
  }
}

Note that billing is based upon “prompt tokens” and “completion tokens” above. View prices on our pricing page.

Streaming Responses

The following snippet shows you how to obtain the model’s response incrementally as it is generated using streaming (using stream=True).

Python
from octoai.chat import TextModel
from octoai.client import Client

client = Client()
for completion in client.chat.completions.create(
    model=TextModel.LLAMA_2_70B_CHAT,
    messages=[
        {
            "role": "system",
            "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
        },
        {"role": "user", "content": "Write a blog about Seattle"},
    ],
    max_tokens=150,
    stream=True,
):
  print(completion.choices[0].delta.content, end='', flush=True)

When using streaming mode, the response is of type Iterable[ChatCompletion]. To read each incremental response from the model, you can use a for loop over the returned object. The example above prints each incremental response as it arrives, and they accumulate to form the entire response in the output as the model prediction progresses.

Additional Parameters

To learn about the additional parameters supported by the Client().chat.completions.create() method, see the API reference.