Guide: Audio APIs

Introduction

This guide explains how to use our different APIs related to speech and other audio.

Supported Models

For a complete list of available models, please refer to the Models section and filter by the desired capability. Each model has specific features and limitations detailed in their respective documentation.

TTS - create speech

This API creates an audio stream from input text.

Using Python

from openai import OpenAI

client = OpenAI(
  api_key="sk-...",
  base_url="https://api.platform.a15t.com/v1",
)

speech_file_path = "out.wav"
response = client.audio.speech.create(
  model="skt/axtts-2-6",
  voice="aria",
  input="안녕하세요",
  response_format="wav"
)
response.stream_to_file(speech_file_path)

Using cURL

curl -s https://api.platform.a15t.com/v1/audio/speech \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_KEY" \
    -d '{
        "input": "안녕하세요!",
        "model": "skt/axtts-2-6",
        "voice": "aria",
        "response_format": "wav"
    }' > out.wav

Configuration options

Additional request parameters not supported by the OpenAI API can be provided via the model_extensions field.

curl -s https://api.platform.a15t.com/v1/audio/speech \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $API_KEY" \
    -d '{
        "input": "안녕하세요!",
        "model": "skt/axtts-2-6",
        "voice": "aria",
        "response_format": "wav",
        "speed": 1.0,
        "model_extensions": {
            "sr": 16000
        }
    }' > out.wav

Realtime STT - Live transcription of speech to text

This API creates a bidirectional stream of audio speech as input and text as output. To support the input speech as a stream a websocket is used.

API interface

The API requires the model query parameter to be set in the request. This field will contain the model to be used for transcription.

Before the client send the audio stream it must send an init message containing necessary info for decoding the stream, and optionally additional configuration.

The init message must be sent as a string that can be decoded into a JSON object with the following fields:

sample_rate - The sample rate of the input audio stream (required)
encoding - The encoding of the stream. If not set it will use the model's default setting.

Example init message

'{"sample_rate": 16000}'

After the init message is sent the client should continuously send audio chunks as binary websocket messages.

When the client has finished streaming the audio input it should send a termination message to indicate that the audio stream has ended. This allows the server to finish generating the final text and gracefully close the connection to the client. The termination message should include a JSON string with a single field done.

Example termination message

'{"done": true}'

The STT stream will return JSON string messages with the following fields:

audio_start (number) - Start time of this audio clip since session start in milliseconds
audio_end (number) - End time of this audio clip since session start in milliseconds
confidence (number) - How confident the model was in this transcription (0-1)
text (string) - The transcription for the respective part of the audio.
words (array) - List of word objects for each word in the transcription.
- start (number) - Start of this word in milliseconds.
- end (number) - End of this word in milliseconds.
- confidence (number) - How confident the model was in transcribing this word.
- text string - The word in text.
created (string) - Transcript's timestamp.
message_type (string) - Describes the type of message. Either PartialTranscript for parts of a message transcript or FinalTranscript for when the last audio session sent was finished transcribing.

Example - Python

import json
import os

import websocket
import pyaudio

from threading import Thread

# CHANGEME: set API key
API_KEY = "sk-..."

MODEL = "assemblyai/assemblyai"
HOST = "wss://api.platform.a15t.com"

FRAMES_PER_BUFFER = 3200
FORMAT = pyaudio.paInt16
CHANNELS = 1
SAMPLE_RATE = 16000
p = pyaudio.PyAudio()

stream = p.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=SAMPLE_RATE,
    input=True,
    frames_per_buffer=FRAMES_PER_BUFFER,
)


def on_message(ws, message):
    """
    is being called on every message
    """
    transcript = json.loads(message)
    text = transcript["text"]
    if text != "":
        if transcript["message_type"] == "PartialTranscript":
            print(f"Partial transcript received: {text}")
        elif transcript["message_type"] == "FinalTranscript":
            print(f"Final transcript received: {text}")


def on_error(ws, error):
    print(error)


def on_close(ws, close_status_code, close_msg):
    print("### closed ###")


def on_open(ws):
    print("Opened connection")

    init = {"sample_rate": 16000}
    ws.send_text(json.dumps(init))

    def send_data():
        while True:
            # read from the microphone
            data = stream.read(FRAMES_PER_BUFFER)

            # binary data can be sent directly
            ws.send(data, opcode=websocket.ABNF.OPCODE_BINARY)

    # Start a thread where we send data to avoid blocking the 'read' thread
    Thread(target=send_data).start()


ws = websocket.WebSocketApp(
    url=f"{HOST}/v1/audio/transcriptions-realtime?model={MODEL}",
    header={"Authorization": f"Bearer {API_KEY}"},
    on_message=on_message,
    on_error=on_error,
    on_close=on_close,
    on_open=on_open,
)

# In this example we run forever, though in a real application the websocket should be closed
# via a termination message.
ws.run_forever()