Language Model (LLM) User Manual

1. Model Core Capabilities

1.1 Basic Functions

Text Generation: Generate coherent natural language text based on context, supporting various styles and genres. Semantic Understanding: Deeply parse user intent, supporting multi-round dialogue management to ensure the coherence and accuracy of conversations. Knowledge Q&A: Cover a wide range of knowledge domains, including science, technology, culture, history, etc., providing accurate knowledge answers. Code Assistance: Support code generation, explanation, and debugging for multiple mainstream programming languages (such as Python, Java, C++, etc.).

1.2 Advanced Capabilities

Long Text Processing: Support context windows of 4k to 64k tokens, suitable for long document generation and complex dialogue scenarios. Instruction Following: Precisely understand complex task instructions, such as “compare A/B schemes using a Markdown table.” Style Control: Adjust output style through system prompts, supporting various styles such as academic, conversational, and poetry. Multimodal Support: In addition to text generation, support tasks such as image description and speech-to-text.

2. API Call Specifications

2.1 Basic Request Structure

You can make end-to-end API requests using the OpenAI SDK

2.2 Message Structure Explanation

Message TypeFunction DescriptionExample Content
systemModel instructions, defining the AI’s role and general behaviore.g., “You are a pediatrician with 10 years of experience.”
userUser input, passing the end user’s message to the modele.g., “How should a persistent fever in a toddler be treated?“
assistantModel-generated historical responses, providing examples of how it should respond to the current requeste.g., “I suggest measuring the temperature first…”
When you want the model to follow hierarchical instructions, message roles can help you achieve better outputs. However, they are not deterministic, so the best approach is to try different methods to see which yields optimal results.

3. Model Selection Guide

Visit the Models to filter language models supporting different functionalities using the filter options on the left. Learn about specific model details such as pricing, model size, maximum context length, and cost. You can also experience the models in the Playground. Note that the Playground is only for model testing and does not retain historical conversation records. If you wish to save the conversation history, please do so manually. For more usage details, refer to the API Documentation.

4. Detailed Explanation of Core Parameters

4.1 Creativity Control

# Temperature parameter (0.0~2.0)   
temperature=0.5  # Balances creativity and reliability  

# Nucleus sampling (top_p)   
top_p=0.9  # Considers only the top 90% probability cumulative word set  

4.2 Output Limits

max_tokens=1000  # Maximum generation length per request  
stop=["\n##", "<|end|>"]  # Stop sequences; output halts when encountering these strings 
frequency_penalty=0.5  # Suppresses repetitive word usage (-2.0~2.0)  
stream=true # Controls whether the output is streamed; recommended for models with lengthy outputs to prevent timeouts

4.3 Common Issues with Language Model Scenarios

1. Model Output Garbled Some models may produce garbled output if parameters are not set. To address this, try setting parameters like temperature, top_k, top_p, and frequency_penalty. Corresponding payload adjustments for different languages:
payload = {
    "model": "Qwen/Qwen2.5-Math-72B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "1+1=?",
        }
    ],
    "max_tokens": 200,  # Adjust as needed
    "temperature": 0.7, # Adjust as needed
    "top_k": 50,        # Adjust as needed
    "top_p": 0.7,       # Adjust as needed
    "frequency_penalty": 0 # Adjust as needed
}
2. Explanation of max_tokens The max_tokens is equal to the context length. Since some model inference services are still being updated, please do not set max_tokens to the maximum value (context length) when making a request. It is recommended to reserve around 10k as space for input content. 3. Explanation of context_length The context_length varies across different LLM models. You can search for specific models on the Models to view detailed information. 4. Output Truncation Issues in Model Inference Here are several aspects to troubleshoot the issue:
  • When encountering output truncation through API requests:
    • Max Tokens Setting: Set the max_token to an appropriate value. If the output exceeds the max_token, it will be truncated.
    • Stream Request Setting: In non-stream requests, long output content is prone to 504 timeout issues.
    • Client Timeout Setting: Increase the client timeout to prevent truncation before the output is fully completed.
  • When encountering output truncation through third-party client requests:
    • CherryStdio has a default max_tokens of 4,096. Users can enable the “Enable Message Length Limit” switch to set the max_token to an appropriate value.
5. Error Code Handling
Error CodeCommon CauseSolution
400Parameter format errorCheck the range of parameters like temperature
401API Key not correctly setVerify the API Key
403Insufficient permissionsCommonly requires real-name authentication; refer to error messages for other cases
429Request rate limit exceededImplement exponential backoff retry mechanism
503/504Model overloadSwitch to backup model nodes

5. Billing and Quota Management

5.1 Billing Formula

Total Cost = (Input Tokens × Input Unit Price) + (Output Tokens × Output Unit Price)

5.2 Example Pricing for Different Series

Specific model prices can be found on the Models under the model details page.

6. Application Scenarios

6.1 Technical Documentation Generation

from openai import OpenAI
client = OpenAI(api_key="YOUR_KEY", base_url="https://api.siliconflow.com/v1")
response = client.chat.completions.create(  
    model="Qwen/Qwen2.5-Coder-32B-Instruct",  
    messages=[{  
        "role": "user",  
        "content": "Write a Python tutorial on asynchronous web scraping, including code examples and precautions."  
    }],  
    temperature=0.7,  
    max_tokens=4096  
)  

6.2 Data Analysis Reports

from openai import OpenAI
client = OpenAI(api_key="YOUR_KEY", base_url="https://api.siliconflow.com/v1")
response = client.chat.completions.create(  
    model="Qwen/QVQ-72B-Preview",  
    messages=[    
        {"role": "system", "content": "You are a data analysis expert. Output results in Markdown."},  
        {"role": "user", "content": "Analyze the sales trends of new energy vehicles in 2023."}  
    ],  
    temperature=0.7,  
    max_tokens=4096  
)  
Model capabilities are continuously updated. It is recommended to visit the Models regularly for the latest information.