Streaming

WebSocket protocol for real-time voice synthesis and dialogue.

Overview

The WebSocket API enables real-time bidirectional communication for streaming synthesis and interactive dialogue. Connect once and send multiple synthesis requests without HTTP overhead.

Connecting

Open a WebSocket connection to the streaming endpoint. Authenticate by passing your API key as a query parameter or in the first message.

WS/v1/ws/streamAUTH

WebSocket endpoint for real-time streaming synthesis.

Open a WebSocket connection

javascript

const ws = new WebSocket(
  'wss://originneural.ai/v1/ws/stream?token=origin_sk_your_key'
);

ws.onopen = () => {
  console.log('Connected');
};

ws.onmessage = (event) => {
  // Handle incoming audio chunks or status messages
  const data = JSON.parse(event.data);
  if (data.type === 'audio') {
    // data.audio is base64-encoded PCM audio
    playAudioChunk(data.audio);
  }
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

Message Types

The WebSocket protocol uses JSON messages with a type field. Here are the message types for client-to-server and server-to-client communication.

Client → Server: "synthesize" — Start synthesis with text and voice_id.
Client → Server: "stop" — Cancel the current synthesis.
Client → Server: "ping" — Keep-alive ping.
Server → Client: "audio" — Base64-encoded PCM audio chunk.
Server → Client: "done" — Synthesis complete.
Server → Client: "error" — Error message with code and description.
Server → Client: "pong" — Response to ping.

Message format examples

json

{
  "type": "synthesize",
  "text": "Hello world",
  "voice_id": "default",
  "engine": "kokoro",
  "speed": 1.0
}

Dialogue Sessions

The Echo engine (moshi) supports interactive dialogue sessions. Open a dialogue WebSocket and send text turns — the engine maintains conversational context across turns.

Dialogue sessions use the Echo engine (moshi) exclusively.
Context is maintained for the duration of the WebSocket connection.
Max session duration is 10 minutes. Reconnect to start a new session.

WS/v1/ws/dialogueAUTH

WebSocket endpoint for interactive voice dialogue.

Dialogue session

javascript

const ws = new WebSocket(
  'wss://originneural.ai/v1/ws/dialogue?token=origin_sk_your_key'
);

ws.onopen = () => {
  // Start a dialogue turn
  ws.send(JSON.stringify({
    type: 'turn',
    text: 'Tell me about voice synthesis.',
    voice_id: 'default',
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'response_audio') {
    playAudioChunk(data.audio);
  } else if (data.type === 'response_text') {
    console.log('Response:', data.text);
  }
};

Best Practices

Follow these best practices for reliable WebSocket connections:

Send ping messages every 30 seconds to keep the connection alive.
Implement reconnection logic with exponential backoff.
Buffer audio chunks client-side before playback to avoid gaps.
Close the connection gracefully when done (ws.close(1000)).

← PreviousVoices

Next →VoiceForge

Connecting

Open a WebSocket connection to the streaming endpoint. Authenticate by passing your API key as a query parameter or in the first message.

WS/v1/ws/streamAUTH

WebSocket endpoint for real-time streaming synthesis.

Open a WebSocket connection

javascript

const ws = new WebSocket(
  'wss://originneural.ai/v1/ws/stream?token=origin_sk_your_key'
);

ws.onopen = () => {
  console.log('Connected');
};

ws.onmessage = (event) => {
  // Handle incoming audio chunks or status messages
  const data = JSON.parse(event.data);
  if (data.type === 'audio') {
    // data.audio is base64-encoded PCM audio
    playAudioChunk(data.audio);
  }
};

ws.onerror = (error) => {
  console.error('WebSocket error:', error);
};

Message Types

The WebSocket protocol uses JSON messages with a type field. Here are the message types for client-to-server and server-to-client communication.

Client → Server: "synthesize" — Start synthesis with text and voice_id.
Client → Server: "stop" — Cancel the current synthesis.
Client → Server: "ping" — Keep-alive ping.
Server → Client: "audio" — Base64-encoded PCM audio chunk.
Server → Client: "done" — Synthesis complete.
Server → Client: "error" — Error message with code and description.
Server → Client: "pong" — Response to ping.

Message format examples

json

{
  "type": "synthesize",
  "text": "Hello world",
  "voice_id": "default",
  "engine": "kokoro",
  "speed": 1.0
}

Dialogue Sessions

The Echo engine (moshi) supports interactive dialogue sessions. Open a dialogue WebSocket and send text turns — the engine maintains conversational context across turns.

Dialogue sessions use the Echo engine (moshi) exclusively.
Context is maintained for the duration of the WebSocket connection.
Max session duration is 10 minutes. Reconnect to start a new session.

WS/v1/ws/dialogueAUTH

WebSocket endpoint for interactive voice dialogue.

Dialogue session

javascript

const ws = new WebSocket(
  'wss://originneural.ai/v1/ws/dialogue?token=origin_sk_your_key'
);

ws.onopen = () => {
  // Start a dialogue turn
  ws.send(JSON.stringify({
    type: 'turn',
    text: 'Tell me about voice synthesis.',
    voice_id: 'default',
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'response_audio') {
    playAudioChunk(data.audio);
  } else if (data.type === 'response_text') {
    console.log('Response:', data.text);
  }
};

Best Practices

Follow these best practices for reliable WebSocket connections:

Send ping messages every 30 seconds to keep the connection alive.
Implement reconnection logic with exponential backoff.
Buffer audio chunks client-side before playback to avoid gaps.
Close the connection gracefully when done (ws.close(1000)).