Guide: Audio APIs
Introduction
This guide explains how to use our different APIs related to speech and other audio.
Supported Models
For a complete list of available models, please refer to the Models section and filter by the desired capability. Each model has specific features and limitations detailed in their respective documentation.
TTS - create speech
This API creates an audio stream from input text.
Using Python
from openai import OpenAI
client = OpenAI(
api_key="sk-...",
base_url="https://api.platform.a15t.com/v1",
)
speech_file_path = "out.wav"
response = client.audio.speech.create(
model="skt/axtts-2-6",
voice="aria",
input="안녕하세요",
response_format="wav"
)
response.stream_to_file(speech_file_path)
Using cURL
curl -s https://api.platform.a15t.com/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"input": "안녕하세요!",
"model": "skt/axtts-2-6",
"voice": "aria",
"response_format": "wav"
}' > out.wav
Configuration options
Additional request parameters not supported by the OpenAI API can be provided via the model_extensions
field.
curl -s https://api.platform.a15t.com/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"input": "안녕하세요!",
"model": "skt/axtts-2-6",
"voice": "aria",
"response_format": "wav",
"speed": 1.0,
"model_extensions": {
"sr": 16000
}
}' > out.wav
Realtime STT - Live transcription of speech to text
This API creates a bidirectional stream of audio speech as input and text as output. To support the input speech as a stream a websocket is used.
API interface
The API requires the model
query parameter to be set in the request. This field will contain the model to be used for transcription.
Before the client send the audio stream it must send an init message containing necessary info for decoding the stream, and optionally additional configuration.
The init message must be sent as a string that can be decoded into a JSON object with the following fields:
sample_rate
- The sample rate of the input audio stream (required)encoding
- The encoding of the stream. If not set it will use the model's default setting.
Example init message
'{"sample_rate": 16000}'
After the init message is sent the client should continuously send audio chunks as binary websocket messages.
When the client has finished streaming the audio input it should send a termination message to indicate that the audio stream has ended. This allows the server to finish generating the final text and gracefully close the connection to the client. The termination message should include a JSON string with a single field done
.
Example termination message
'{"done": true}'
The STT stream will return JSON string messages with the following fields:
audio_start
(number) - Start time of this audio clip since session start in millisecondsaudio_end
(number) - End time of this audio clip since session start in millisecondsconfidence
(number) - How confident the model was in this transcription (0-1)text
(string) - The transcription for the respective part of the audio.words
(array) - List of word objects for each word in the transcription.start
(number) - Start of this word in milliseconds.end
(number) - End of this word in milliseconds.confidence
(number) - How confident the model was in transcribing this word.text
string - The word in text.
created
(string) - Transcript's timestamp.message_type
(string) - Describes the type of message. EitherPartialTranscript
for parts of a message transcript orFinalTranscript
for when the last audio session sent was finished transcribing.
Example - Python
import json
import os
import websocket
import pyaudio
from threading import Thread
# CHANGEME: set API key
API_KEY = "sk-..."
MODEL = "assemblyai/assemblyai"
HOST = "wss://api.platform.a15t.com"
FRAMES_PER_BUFFER = 3200
FORMAT = pyaudio.paInt16
CHANNELS = 1
SAMPLE_RATE = 16000
p = pyaudio.PyAudio()
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=SAMPLE_RATE,
input=True,
frames_per_buffer=FRAMES_PER_BUFFER,
)
def on_message(ws, message):
"""
is being called on every message
"""
transcript = json.loads(message)
text = transcript["text"]
if text != "":
if transcript["message_type"] == "PartialTranscript":
print(f"Partial transcript received: {text}")
elif transcript["message_type"] == "FinalTranscript":
print(f"Final transcript received: {text}")
def on_error(ws, error):
print(error)
def on_close(ws, close_status_code, close_msg):
print("### closed ###")
def on_open(ws):
print("Opened connection")
init = {"sample_rate": 16000}
ws.send_text(json.dumps(init))
def send_data():
while True:
# read from the microphone
data = stream.read(FRAMES_PER_BUFFER)
# binary data can be sent directly
ws.send(data, opcode=websocket.ABNF.OPCODE_BINARY)
# Start a thread where we send data to avoid blocking the 'read' thread
Thread(target=send_data).start()
ws = websocket.WebSocketApp(
url=f"{HOST}/v1/audio/transcriptions-realtime?model={MODEL}",
header={"Authorization": f"Bearer {API_KEY}"},
on_message=on_message,
on_error=on_error,
on_close=on_close,
on_open=on_open,
)
# In this example we run forever, though in a real application the websocket should be closed
# via a termination message.
ws.run_forever()