Speech Synthesis

Generate speech from text using synchronous, streaming, and async modes.

Synthesize Speech

The primary synthesis endpoint. Send text and a voice ID, get back WAV audio. The engine is selected automatically based on your request parameters and the voice type, or you can specify it explicitly.

POST/v1/speakAUTH

Generate speech audio from text. Returns audio/wav binary data.

Parameters

Name	Type	Description
`text`*	string	The text to synthesize. Max length depends on your plan.
`voice_id`*	string	Voice identifier. Use "default" or a specific voice ID.
`engine`	string	Engine to use: kokoro (Spark), f5tts (Flux), fish_speech (Tide).Default: auto
`speed`	number	Playback speed multiplier (0.5 - 2.0).Default: 1.0
`format`	string	Output format: wav, mp3, ogg.Default: wav

Request Body

{
  "text": "Hello world",
  "voice_id": "default",
  "speed": 1.0,
  "format": "wav"
}

Basic synthesis

bash

curl -X POST https://originneural.ai/v1/speak \
  -H "Authorization: Bearer origin_sk_your_key" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice_id": "default"}' \
  --output speech.wav

Streaming Synthesis

For real-time applications, use the streaming endpoint. Audio chunks are sent as they're generated, reducing time-to-first-byte significantly. See the Streaming page for WebSocket details.

Streaming is only supported by the Spark engine (kokoro — Swift Lane). Other engines will fall back to non-streaming mode.
The response uses chunked transfer encoding — read the stream progressively.

POST/v1/speak/streamAUTH

Stream audio chunks as they are generated. Returns chunked transfer encoding.

Parameters

Name	Type	Description
`text`*	string	The text to synthesize.
`voice_id`*	string	Voice identifier.
`engine`	string	Engine to use: kokoro (Spark).Default: kokoro
`chunk_size`	number	Audio chunk size in samples.Default: 4096

Async Synthesis

For long-form content (articles, books, podcasts), use async synthesis. Submit a job and poll for status. Completed audio is available for download for 24 hours.

POST/v1/speak/asyncAUTH

Submit a long-form synthesis job. Returns a job ID for status polling.

Parameters

Name	Type	Description
`text`*	string	Text to synthesize. Up to 100,000 characters.
`voice_id`*	string	Voice identifier.
`engine`	string	Engine to use: f5tts (Flux).Default: f5tts
`webhook_url`	string	URL to POST when the job completes.

Response

{
  "job_id": "job_abc123",
  "status": "queued",
  "estimated_seconds": 45
}

GET/v1/speak/async/{job_id}AUTH

Check the status of an async synthesis job.

Response

{
  "job_id": "job_abc123",
  "status": "completed",
  "download_url": "https://originneural.ai/v1/speak/async/job_abc123/download",
  "duration_seconds": 32.5,
  "expires_at": "2026-02-08T12:00:00Z"
}

Engine Selection

ORIGIN Neural runs four synthesis engines, each optimized for different use cases. The engine parameter lets you choose explicitly, or set it to "auto" to let the system pick the best engine for your request.

Spark (kokoro) — Swift Lane. Lowest latency (~200ms). Best for real-time and streaming. Supports all base voices.
Flux (f5tts) — Clarity Lane. Highest quality. Best for long-form content and cloned voices. ~1.5s latency.
Tide (fish_speech) — Clarity Lane. Multi-lingual specialist. Best for non-English content. ~1.2s latency.
Echo (moshi) — Dialogue Lane. Conversational AI. Best for interactive dialogue and turn-based chat.

Demo Endpoint

The demo endpoint is a simplified version of /v1/speak designed for the website voice demo. It has lower character limits but works with the same voices.

POST/api/demo/speakAUTH

Simplified synthesis for the web demo. Max 200 characters.

Parameters

Name	Type	Description
`text`*	string	Text to synthesize. Max 200 characters.
`voice_id`*	string	Voice identifier.
`speed`	number	Playback speed (0.5 - 2.0).Default: 1.0

← PreviousSDKs & Examples

Next →Voices

Speech Synthesis

Generate speech from text using synchronous, streaming, and async modes.

Synthesize Speech

POST/v1/speakAUTH

Generate speech audio from text. Returns audio/wav binary data.

Parameters

Name	Type	Description
`text`*	string	The text to synthesize. Max length depends on your plan.
`voice_id`*	string	Voice identifier. Use "default" or a specific voice ID.
`engine`	string	Engine to use: kokoro (Spark), f5tts (Flux), fish_speech (Tide).Default: auto
`speed`	number	Playback speed multiplier (0.5 - 2.0).Default: 1.0
`format`	string	Output format: wav, mp3, ogg.Default: wav

Request Body

{
  "text": "Hello world",
  "voice_id": "default",
  "speed": 1.0,
  "format": "wav"
}

Basic synthesis

bash

curl -X POST https://originneural.ai/v1/speak \
  -H "Authorization: Bearer origin_sk_your_key" \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello world", "voice_id": "default"}' \
  --output speech.wav

Streaming Synthesis

For real-time applications, use the streaming endpoint. Audio chunks are sent as they're generated, reducing time-to-first-byte significantly. See the Streaming page for WebSocket details.

Streaming is only supported by the Spark engine (kokoro — Swift Lane). Other engines will fall back to non-streaming mode.
The response uses chunked transfer encoding — read the stream progressively.

POST/v1/speak/streamAUTH

Stream audio chunks as they are generated. Returns chunked transfer encoding.

Parameters

Name	Type	Description
`text`*	string	The text to synthesize.
`voice_id`*	string	Voice identifier.
`engine`	string	Engine to use: kokoro (Spark).Default: kokoro
`chunk_size`	number	Audio chunk size in samples.Default: 4096

Async Synthesis

For long-form content (articles, books, podcasts), use async synthesis. Submit a job and poll for status. Completed audio is available for download for 24 hours.

POST/v1/speak/asyncAUTH

Submit a long-form synthesis job. Returns a job ID for status polling.

Parameters

Name	Type	Description
`text`*	string	Text to synthesize. Up to 100,000 characters.
`voice_id`*	string	Voice identifier.
`engine`	string	Engine to use: f5tts (Flux).Default: f5tts
`webhook_url`	string	URL to POST when the job completes.

Response

{
  "job_id": "job_abc123",
  "status": "queued",
  "estimated_seconds": 45
}

GET/v1/speak/async/{job_id}AUTH

Check the status of an async synthesis job.

Response

{
  "job_id": "job_abc123",
  "status": "completed",
  "download_url": "https://originneural.ai/v1/speak/async/job_abc123/download",
  "duration_seconds": 32.5,
  "expires_at": "2026-02-08T12:00:00Z"
}

Engine Selection

Spark (kokoro) — Swift Lane. Lowest latency (~200ms). Best for real-time and streaming. Supports all base voices.
Flux (f5tts) — Clarity Lane. Highest quality. Best for long-form content and cloned voices. ~1.5s latency.
Tide (fish_speech) — Clarity Lane. Multi-lingual specialist. Best for non-English content. ~1.2s latency.
Echo (moshi) — Dialogue Lane. Conversational AI. Best for interactive dialogue and turn-based chat.

Demo Endpoint

The demo endpoint is a simplified version of /v1/speak designed for the website voice demo. It has lower character limits but works with the same voices.

POST/api/demo/speakAUTH

Simplified synthesis for the web demo. Max 200 characters.

Parameters

Name	Type	Description
`text`*	string	Text to synthesize. Max 200 characters.
`voice_id`*	string	Voice identifier.
`speed`	number	Playback speed (0.5 - 2.0).Default: 1.0

← PreviousSDKs & Examples

Next →Voices