Guide: Chat Completions

This guide explains how to use the chat completions API and interpret the gateway information found under platform_extensions.

Retry information

Responses include a platform_extensions object from the gateway. It contains a
routing_results object describing routing attempts as has the following fields.

latency - total latency across all attempts in milliseconds.
private_endpoint_enabled - whether a private endpoint was used.
retry_info - Nested retry information

The retry_info object has the following fields:

retry_count - total number of retries attempted.
fallback_model - the model ultimately used after retries.
retries - array describing each attempt. Each entry contains:
- index - the order of the attempt starting at 0.
- model - model that was called for that attempt.
- code - error code that triggered the retry.
- message - text describing the failure.
- latency - how long the attempt took in milliseconds.

Each element in retries corresponds to a routing attempt that failed before the final successful request.

Example response

{
  "id": "chatcmpl-abc",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hello!"
      }
    }
  ],
  "platform_extensions": {
    "routing_results": {
      "private_endpoint_enabled": false,
      "latency": 1717,
      "retry_info": {
        "retry_count": 2,
        "fallback_model": "openai/gpt-4o-2024-11-20",
        "retries": [
        {
          "index": 0,
          "model": "anthropic/claude-3-sonnet",
          "code": 503,
          "message": "Service unavailable",
          "latency": 812
        },
        {
          "index": 1,
          "model": "anthropic/claude-3-sonnet",
          "code": 429,
          "message": "Rate limit exceeded",
          "latency": 905
        }
        ]
      }
    }
  }
}

Parsing retry information in Python

response = client.chat.completions.create(
    model="anthropic/claude-3-sonnet",
    messages=[{"role": "user", "content": "Hello"}]
)
resp_info = response.platform_extensions
routing_results = resp_info.get("routing_results", {})
print("total latency", routing_results.get("latency"))
print("private endpoint", routing_results.get("private_endpoint_enabled"))
retry_info = routing_results.get("retry_info", {})
retry_count = retry_info.get("retry_count", 0)
fallback = retry_info.get("fallback_model")
for attempt in retry_info.get("retries", []):
    print(
        attempt["index"],
        attempt["model"],
        attempt["code"],
        attempt["message"],
        attempt["latency"],
    )