Text-to-speech
1. Use Cases
The Text-to-Speech (TTS) model is an AI model that converts text information into speech output. This model generates natural, fluent, and expressive speech from input text, suitable for various application scenarios:
- Providing audio narration for blog articles
- Generating multilingual speech content
- Supporting real-time streaming audio output
2. API Usage Guide
- Endpoint: /audio/speech. For details, refer to the API documentation.
- Key request parameters:
model
: The model used for speech synthesis. Supported model list.input
: The text content to be converted into audio.voice
: Reference voice, supporting system preset voices, user preset voices, and user dynamic voices. For detailed parameters, see: Create Text-to-Speech Request.speed
: Controls the audio speed. Type: float. Default value: 1.0. Range: [0.25, 4.0].gain
: Audio gain in dB, controlling the volume of the audio. Type: float. Default value: 0.0. Range: [-10, 10].response_format
: Controls the output format. Supported formats include mp3, opus, wav, and pcm. The sampling rate varies depending on the output format.sample_rate
: Controls the output sampling rate. The default value and available range vary by output type:- Opus: Currently supports 48000 Hz only.
- Wav, pcm: Supports (8000, 16000, 24000, 32000, 44100). Default: 44100.
- Mp3: Supports (32000, 44100). Default: 44100.
2.1 System Preset Voices
The system currently provides the following 8 preset voices:
-
Male Voices:
- Steady Male Voice: alex
- Deep Male Voice: benjamin
- Magnetic Male Voice: charles
- Cheerful Male Voice: david
-
Female Voices:
- Steady Female Voice: anna
- Passionate Female Voice: bella
- Gentle Female Voice: claire
- Cheerful Female Voice: diana
To use system preset voices in requests, you need to prefix the model name. For example:
FunAudioLLM/CosyVoice2-0.5B:alex
indicates the alex voice from the FunAudioLLM/CosyVoice2-0.5B model.
2.2 User Preset Voices
2.2.1 Upload User Preset Voice via base64
Encoding
The returned uri
field in the response is the ID of the custom voice, which can be used as the voice
parameter in subsequent requests.
To use user preset voices in requests.
2.2.2 Upload User Preset Voice via File
The returned uri
field in the response is the ID of the custom voice, which can be used as the voice
parameter in subsequent requests.
To use user preset voices in requests.
2.3 Retrieve User Dynamic Voice List
The returned uri
field in the response is the ID of the custom voice, which can be used as the voice
parameter in subsequent requests.
To use user dynamic voices in requests.
2.4 Use User Dynamic Voices
To use user dynamic voices in requests.
2.5 Delete User Dynamic Voice
The uri
field in the request parameters is the ID of the custom voice.
3. Supported Model List
3.1 FunAudioLLM/CosyVoice2-0.5B Series Models
- Cross-language speech synthesis: Enables speech synthesis across different languages, including Chinese, English, Japanese, Korean, and Chinese dialects (Cantonese, Sichuanese, Shanghainese, Zhengzhou dialect, Changsha dialect, Tianjin dialect).
- Emotion control: Supports generating speech with various emotional expressions, such as happiness, excitement, sadness, and anger.
- Fine-grained control: Allows fine-grained control of emotions and prosody in generated speech through rich text or natural language input.
4. Best Practices for Reference Audio
Providing high-quality reference audio samples can improve voice cloning results.
4.1 Audio Quality Guidelines
- Single speaker only
- Clear articulation, stable volume, pitch, and emotion
- Short pauses (recommended: 0.5 seconds)
- Ideal conditions: No background noise, professional recording quality, no room echo
- Recommended duration: 8–10 seconds
4.2 File Format
- Supported formats: mp3, wav, pcm, opus
- Recommended: Use mp3 with 192kbps or higher to avoid quality loss
- Uncompressed formats (e.g., WAV) offer limited additional benefits