1. Use Cases

The Text-to-Speech (TTS) model is an AI model that converts text information into speech output. This model generates natural, fluent, and expressive speech from input text, suitable for various application scenarios:

  • Providing audio narration for blog articles
  • Generating multilingual speech content
  • Supporting real-time streaming audio output

2. API Usage Guide

  • Endpoint: /audio/speech. For details, refer to the API documentation.
  • Key request parameters:
    • model: The model used for speech synthesis. Supported model list.
    • input: The text content to be converted into audio.
    • voice: Reference voice, supporting system preset voices, user preset voices, and user dynamic voices. For detailed parameters, see: Create Text-to-Speech Request.
    • speed: Controls the audio speed. Type: float. Default value: 1.0. Range: [0.25, 4.0].
    • gain: Audio gain in dB, controlling the volume of the audio. Type: float. Default value: 0.0. Range: [-10, 10].
    • response_format: Controls the output format. Supported formats include mp3, opus, wav, and pcm. The sampling rate varies depending on the output format.
    • sample_rate: Controls the output sampling rate. The default value and available range vary by output type:
      • Opus: Currently supports 48000 Hz only.
      • Wav, pcm: Supports (8000, 16000, 24000, 32000, 44100). Default: 44100.
      • Mp3: Supports (32000, 44100). Default: 44100.

2.1 System Preset Voices

The system currently provides the following 8 preset voices:

  • Male Voices:

    • Steady Male Voice: alex
    • Deep Male Voice: benjamin
    • Magnetic Male Voice: charles
    • Cheerful Male Voice: david
  • Female Voices:

    • Steady Female Voice: anna
    • Passionate Female Voice: bella
    • Gentle Female Voice: claire
    • Cheerful Female Voice: diana

To use system preset voices in requests, you need to prefix the model name. For example:

FunAudioLLM/CosyVoice2-0.5B:alex indicates the alex voice from the FunAudioLLM/CosyVoice2-0.5B model.

2.2 User Preset Voices

2.2.1 Upload User Preset Voice via base64 Encoding

import requests
import json

url = "https://api.ap.siliconflow.com/v1/uploads/audio/voice"
headers = {
    "Authorization": "Bearer your-api-key",  # Obtain from https://cloud.siliconflow.com/account/ak
    "Content-Type": "application/json"
}
data = {
    "model": "FunAudioLLM/CosyVoice2-0.5B",  # Model name
    "customName": "your-voice-name",  # Custom audio name
    "audio": "data:audio/mpeg;base64,...",  # Base64 encoded reference audio
    "text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins."  # Text content of reference audio
}

response = requests.post(url, headers=headers, data=json.dumps(data))

# Print response status code and content
print(response.status_code)
print(response.json())  # If the response is in JSON format

The returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

To use user preset voices in requests.

2.2.2 Upload User Preset Voice via File

import requests

url = "https://api.ap.siliconflow.com/v1/uploads/audio/voice"
headers = {
    "Authorization": "Bearer your-api-key"  # Obtain from https://cloud.siliconflow.com/account/ak
}
files = {
    "file": open("/path/to/audio.mp3", "rb")  # Reference audio file
}
data = {
    "model": "FunAudioLLM/CosyVoice2-0.5B",  # Model name
    "customName": "your-voice-name",  # Custom audio name
    "text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins."  # Text content of reference audio
}

response = requests.post(url, headers=headers, files=files, data=data)

print(response.status_code)
print(response.json())  # Print response content (if in JSON format)

The returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

To use user preset voices in requests.

2.3 Retrieve User Dynamic Voice List

import requests
url = "https://api.ap.siliconflow.com/v1/audio/voice/list"

headers = {
    "Authorization": "Bearer your-api-key"  # Obtain from https://cloud.siliconflow.com/account/ak
}
response = requests.get(url, headers=headers)

print(response.status_code)
print(response.json())  # Print response content (if in JSON format)

The returned uri field in the response is the ID of the custom voice, which can be used as the voice parameter in subsequent requests.

{'uri': 'speech:your-voice-name:cm04pf7az00061413w7kz5qxs:mjtkgbyuunvtybnsvbxd'}

To use user dynamic voices in requests.

2.4 Use User Dynamic Voices

Note: Using user preset voices requires identity verification.

To use user dynamic voices in requests.

2.5 Delete User Dynamic Voice

import requests

url = "https://api.ap.siliconflow.com/v1/audio/voice/deletions"
headers = {
    "Authorization": "Bearer your-api-key",
    "Content-Type": "application/json"
}
payload = {
    "uri": "speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.status_code)
print(response.text)  # Print response content

The uri field in the request parameters is the ID of the custom voice.

3. Supported Model List

Note: Supported TTS models may change. Please filter by the “Speech” tag on the Models for the current list of supported models.
Billing: Charges are based on the UTF-8 byte count of the input text. Online byte counter demo.

3.1 FunAudioLLM/CosyVoice2-0.5B Series Models

  • Cross-language speech synthesis: Enables speech synthesis across different languages, including Chinese, English, Japanese, Korean, and Chinese dialects (Cantonese, Sichuanese, Shanghainese, Zhengzhou dialect, Changsha dialect, Tianjin dialect).
  • Emotion control: Supports generating speech with various emotional expressions, such as happiness, excitement, sadness, and anger.
  • Fine-grained control: Allows fine-grained control of emotions and prosody in generated speech through rich text or natural language input.

4. Best Practices for Reference Audio

Providing high-quality reference audio samples can improve voice cloning results.

4.1 Audio Quality Guidelines

  • Single speaker only
  • Clear articulation, stable volume, pitch, and emotion
  • Short pauses (recommended: 0.5 seconds)
  • Ideal conditions: No background noise, professional recording quality, no room echo
  • Recommended duration: 8–10 seconds

4.2 File Format

  • Supported formats: mp3, wav, pcm, opus
  • Recommended: Use mp3 with 192kbps or higher to avoid quality loss
  • Uncompressed formats (e.g., WAV) offer limited additional benefits

5. Examples

5.1 Using System Preset Voices

from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="Your API KEY",  # Obtain from https://cloud.siliconflow.com/account/ak
    base_url="https://api.ap.siliconflow.com/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B",  # Supported models: fishaudio / GPT-SoVITS / CosyVoice2-0.5B
  voice="FunAudioLLM/CosyVoice2-0.5B:alex",  # System preset voice
  input="Can you say this with happiness? <|endofprompt|>Today is wonderful, the holidays are coming! I'm so happy, Spring Festival is coming!",
  response_format="mp3"  # Supported formats: mp3, wav, pcm, opus
) as response:
    response.stream_to_file(speech_file_path)

5.2 Using User Preset Voices

from pathlib import Path
from openai import OpenAI

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="Your API KEY",  # Obtain from https://cloud.siliconflow.com/account/ak
    base_url="https://api.ap.siliconflow.com/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B",  # Supported models: fishaudio / GPT-SoVITS / CosyVoice2-0.5B
  voice="speech:your-voice-name:cm02pf7az00061413w7kz5qxs:mttkgbyuunvtybnsvbxd",  # Uploaded custom voice name
  input="Could you mimic a Cantonese accent? <|endofprompt|>Take care and rest early.",
  response_format="mp3"
) as response:
    response.stream_to_file(speech_file_path)

5.3 Using User Dynamic Voices

from pathlib import Path
from openai import OpenAI
client = OpenAI()

speech_file_path = Path(__file__).parent / "siliconcloud-generated-speech.mp3"

client = OpenAI(
    api_key="Your API KEY",  # Obtain from https://cloud.siliconflow.com/account/ak
    base_url="https://api.ap.siliconflow.com/v1"
)

with client.audio.speech.with_streaming_response.create(
  model="FunAudioLLM/CosyVoice2-0.5B", 
  voice="",  # Leave empty to use dynamic voices
  input="[laughter] Sometimes, watching the innocent actions of children [laughter], we can't help but smile.",
  response_format="mp3",
  extra_body={"references":[
        {
            "audio": "https://sf-maas-uat-prod.oss-cn-shanghai.aliyuncs.com/voice_template/fish_audio-Alex.mp3",  # Reference audio URL. Base64 format also supported
            "text": "In the midst of ignorance, a day in the dream ends, and a new cycle begins.",  # Text content of reference audio
        }
    ]}
) as response:
    response.stream_to_file(speech_file_path)