Synthesize Speech
The primary synthesis endpoint. Send text and a voice ID, get back WAV audio. The engine is selected automatically based on your request parameters and the voice type, or you can specify it explicitly.
/v1/speakAUTHGenerate speech audio from text. Returns audio/wav binary data.
Parameters
| Name | Type | Description |
|---|---|---|
text* | string | The text to synthesize. Max length depends on your plan. |
voice_id* | string | Voice identifier. Use "default" or a specific voice ID. |
engine | string | Engine to use: kokoro (Spark), f5tts (Flux), fish_speech (Tide).Default: auto |
speed | number | Playback speed multiplier (0.5 - 2.0).Default: 1.0 |
format | string | Output format: wav, mp3, ogg.Default: wav |
Request Body
{
"text": "Hello world",
"voice_id": "default",
"speed": 1.0,
"format": "wav"
}Basic synthesis
curl -X POST https://originneural.ai/v1/speak \
-H "Authorization: Bearer origin_sk_your_key" \
-H "Content-Type: application/json" \
-d '{"text": "Hello world", "voice_id": "default"}' \
--output speech.wavStreaming Synthesis
For real-time applications, use the streaming endpoint. Audio chunks are sent as they're generated, reducing time-to-first-byte significantly. See the Streaming page for WebSocket details.
- Streaming is only supported by the Spark engine (kokoro — Swift Lane). Other engines will fall back to non-streaming mode.
- The response uses chunked transfer encoding — read the stream progressively.
/v1/speak/streamAUTHStream audio chunks as they are generated. Returns chunked transfer encoding.
Parameters
| Name | Type | Description |
|---|---|---|
text* | string | The text to synthesize. |
voice_id* | string | Voice identifier. |
engine | string | Engine to use: kokoro (Spark).Default: kokoro |
chunk_size | number | Audio chunk size in samples.Default: 4096 |
Async Synthesis
For long-form content (articles, books, podcasts), use async synthesis. Submit a job and poll for status. Completed audio is available for download for 24 hours.
/v1/speak/asyncAUTHSubmit a long-form synthesis job. Returns a job ID for status polling.
Parameters
| Name | Type | Description |
|---|---|---|
text* | string | Text to synthesize. Up to 100,000 characters. |
voice_id* | string | Voice identifier. |
engine | string | Engine to use: f5tts (Flux).Default: f5tts |
webhook_url | string | URL to POST when the job completes. |
Response
{
"job_id": "job_abc123",
"status": "queued",
"estimated_seconds": 45
}/v1/speak/async/{job_id}AUTHCheck the status of an async synthesis job.
Response
{
"job_id": "job_abc123",
"status": "completed",
"download_url": "https://originneural.ai/v1/speak/async/job_abc123/download",
"duration_seconds": 32.5,
"expires_at": "2026-02-08T12:00:00Z"
}Engine Selection
ORIGIN Neural runs four synthesis engines, each optimized for different use cases. The engine parameter lets you choose explicitly, or set it to "auto" to let the system pick the best engine for your request.
- Spark (kokoro) — Swift Lane. Lowest latency (~200ms). Best for real-time and streaming. Supports all base voices.
- Flux (f5tts) — Clarity Lane. Highest quality. Best for long-form content and cloned voices. ~1.5s latency.
- Tide (fish_speech) — Clarity Lane. Multi-lingual specialist. Best for non-English content. ~1.2s latency.
- Echo (moshi) — Dialogue Lane. Conversational AI. Best for interactive dialogue and turn-based chat.
Demo Endpoint
The demo endpoint is a simplified version of /v1/speak designed for the website voice demo. It has lower character limits but works with the same voices.
/api/demo/speakAUTHSimplified synthesis for the web demo. Max 200 characters.
Parameters
| Name | Type | Description |
|---|---|---|
text* | string | Text to synthesize. Max 200 characters. |
voice_id* | string | Voice identifier. |
speed | number | Playback speed (0.5 - 2.0).Default: 1.0 |