1. Usage Scenarios
Vision-Language Models (VLM) are large language models capable of processing both visual (image) and linguistic (text) input modalities. Based on VLMs, you can input images and text, and the model can simultaneously understand the content of the images and the context while following instructions to respond. For example:- Visual Content Interpretation: The model can interpret and describe the information in an image, such as objects, text, spatial relationships, colors, and atmosphere.
- Multi-turn Conversations Combining Visual Content and Context.
- Partial Replacement of Traditional Machine Vision Models like OCR.
- Future Applications: With continuous improvements in model capabilities, VLMs can be applied to areas such as visual agents and robotics.
2. Usage Method
For VLM models, you can invoke the/chat/completions
API by constructing a message
containing either an image URL
or a base64-encoded image
. The detail
parameter can be used to control how the image is preprocessed.
2.1 Explanation of Image Detail Control Parameters
SiliconFlow provides three options for thedetail
parameter: low
, high
, and auto
.
For currently supported models, if detail
is not specified or is set to high
, the model will use the high
(“high resolution”) mode. If set to low
or auto
, the model will use the low
(“low resolution”) mode.
2.2 Example Formats for message
Containing Images
2.2.1 Using Image URLs
2.2.2 Base64 Format
2.2.3 Multiple Images, Each in Either Format
Please note that the
DeepseekVL2
series models are suitable for handling short contexts. It is recommended to input no more than 2 images. If more than 2 images are provided, the model will automatically resize them to 384x384, and the specified detail
parameter will be ignored. 3. Supported Models
Currently supported VLM models:- THUDM series:
- THUDM/GLM-4.1V-9B-Thinking
- Qwen Series:
- Qwen/Qwen2-VL-72B-Instruct
- DeepseekVL2 Series:
- deepseek-ai/deepseek-vl2
Note: The list of supported VLM models may change. Please filter by the “Visual” tag in the “Models” to check the supported model list.
4. Billing for Visual Input Content
For visual inputs like images, the model converts them into tokens, which are combined with textual information as part of the model’s output context. This means visual inputs are also billed. Different models use different methods for converting visual content, as outlined below.4.1 Qwen Series
Rules:Qwen
supports a maximum pixel area of3584 * 3584 = 12845056
and a minimum pixel area of56 * 56 = 3136
. Each image’s longer and shorter sides are first scaled to multiples of 28(h * 28) * (w * 28)
. If the dimensions fall outside the minimum and maximum pixel ranges, the image is proportionally resized to fit within the range.
- When
detail=low
, all images are resized to448 * 448
, consuming256 tokens
. - When
detail=high
, the image is proportionally scaled, with its dimensions rounded up to the nearest multiple of 28, then resized to fit within the pixel range(3136, 12845056)
, ensuring both dimensions are multiples of 28.
- Images with dimensions
224 * 448
,1024 x 1024
, and3172 x 4096
consume256 tokens
whendetail=low
. - An image with dimensions
224 * 448
consumes(224/28) * (448/28) = 8 * 16 = 128 tokens
whendetail=high
. - An image with dimensions
1024 * 1024
is rounded to1036 * 1036
and consumes(1036/28) * (1036/28) = 1369 tokens
whendetail=high
. - An image with dimensions
3172 * 4096
is resized to3136 * 4060
and consumes(3136/28) * (4060/28) = 16240 tokens
whendetail=high
.
4.2 DeepseekVL2 Series
Rules: For each image,DeepseekVL2
processes two parts: global_view
and local_view
. The global_view
resizes the original image to 384x384
, while the local_view
divides the image into blocks of 384x384
. Additional tokens are added between blocks to maintain continuity.
- When
detail=low
, all images are resized to384x384
. - When
detail=high
, images are resized to dimensions that are multiples of384
, ensuring1 <= h * w <= 9
.
-
The scaling dimensions
(h, w)
are chosen based on:- Both
h
andw
are integers, and1 <= h * w <= 9
. - The resized image’s pixel count is compared to the original image’s pixel count, minimizing the difference.
- Both
-
Token consumption is calculated as:
(h * w + 1) * 196 + (w + 1) * 14 + 1 tokens
.
-
Images with dimensions
224 x 448
,1024 x 1024
, and2048 x 4096
consume421 tokens
whendetail=low
. -
An image with dimensions
384 x 768
consumes(1 * 2 + 1) * 196 + (2 + 1) * 14 + 1 = 631 tokens
whendetail=high
. -
An image with dimensions
1024 x 1024
is resized to1152 x 1152
and consumes(3 * 3 + 1) * 196 + (3 + 1) * 14 + 1 = 2017 tokens
whendetail=high
. -
An image with dimensions
2048 x 4096
is resized to768 x 1536
and consumes(2 * 4 + 1) * 196 + (4 + 1) * 14 + 1 = 1835 tokens
whendetail=high
. -
Images with dimensions
224 * 448
,1024 * 1024
, and2048 * 4096
, whendetail=low
is selected, will consume256 tokens
each; -
An image with dimensions
224 * 448
, whendetail=high
is selected, has an aspect ratio of1:2
, and will be resized to448 x 896
. At this point,h = 1, w = 2
, consuming(h * w + 1) * 256 = 768 tokens
; -
An image with dimensions
1024 * 1024
, whendetail=high
is selected, has an aspect ratio of1:1
, and will be resized to1344 * 1344 (h = w = 3)
. Since1024 * 1024 > 0.5 * 1344 * 1344
, at this point,h = w = 3
, consuming(3 * 3 + 1) * 256 = 2560 tokens
; -
An image with dimensions
2048 * 4096
, whendetail=high
is selected, has an aspect ratio of1:2
, and under the condition1 <= h * w <= 12
, the largest(h, w)
combination ish = 2, w = 4
. Therefore, it will be resized to896 * 1792
, consuming(2 * 4 + 1) * 256 = 2304 tokens
. */}
4.2 DeepseekVL2 series
Rules:DeepseekVL2
processes each image into two parts: global_view and local_view. global_view resizes the original image to 384*384
pixels, while local_view divides the image into multiple 384*384
blocks. Additional tokens are added to connect the blocks based on the width.
- When
detail=low
, all images will be resized to384*384
pixels. - When
detail=high
, the images will be resized to dimensions that are multiples of384(OpenAI uses 512)
,(h*384) * (w * 384)
, and1 <= h*w <= 9
.
-
The scaling dimensions
h * w
will be chosen according to the following rules:-
Both
h
andw
are integers, and within the constraint1 <= h*w <= 9
, traverse the combinations of(h, w)
. -
Resize the image to
(h*384, w*384)
pixels and compare with the original image’s pixels. Take the minimum value between the new image’s pixels and the original image’s pixels as the effective pixel value. Take the difference between the original image’s pixels and the effective pixel value as the invalid pixel value. If the effective pixel value exceeds the previously determined effective pixel value, or if the effective pixel value is the same but the invalid pixel value is smaller, choose the current(h*384, w*384)
combination. -
Token consumption will follow the following rules:
(h*w + 1) * 196 + (w+1) * 14 + 1 token
-
Both
- Images with dimensions
224 x 448
,1024 x 1024
, and2048 x 4096
, whendetail=low
is selected, will consume421 tokens
each. - An image with dimensions
384 x 768
, whendetail=high
is selected, has an aspect ratio of1:1
and will be resized to384 x 768
. At this point,h=1, w=2
, consuming(1*2 + 1) * 196 + (2+1) * 14 + 1 = 631 tokens
. - An image with dimensions
1024 x 1024
, whendetail=high
is selected, will be resized to1152*1152(h=w=3)
, consuming(3*3 + 1) * 196 + (3+1) * 14 + 1 = 2017 tokens
. - An image with dimensions
2048 x 4096
, whendetail=high
is selected, has an aspect ratio of1:2
and will be resized to768*1536(h=2, w=4)
,consuming (2*4 + 1) * 196 + (4+1) * 14 + 1 = 1835 tokens
.
4.3 GLM-4.1V-9B-Thinking
Rules:GLM-4.1V
supports a minimum pixel size of 28 * 28
, scaling image dimensions proportionally to the nearest integer multiple of 28
pixels.
If the scaled pixel size is smaller than 112 * 112
or larger than 4816894
, adjust the dimensions proportionally to fit within the range while maintaining multiples of 28
.
detail=low
: Resize all images to448*448
pixels, resulting in256 tokens
.detail=high
: Scale proportionally by first rounding the dimensions to the nearest28-pixel
multiple, then adjusting to fit within the pixel range(12544, 4816894)
while ensuring both dimensions remain multiples of28
.
224 x 448
,1024 x 1024
,3172 x 4096
: Withdetail=low
, all consume256 tokens
.224 x 448
: Withdetail=high
, since dimensions are within range and multiples of28
,tokens = (224//28) * (448//28) = 8 * 16 = 128 tokens
.1024 x 1024
: With detail=high, dimensions are rounded to1036*1036
(within range),tokens = (1036//28) * (1036//28) = 1369 tokens
.3172 x 4096
: With detail=high, rounded to3192 x 4088
(exceeds maximum), then scaled proportionally to1932 x 2464
,tokens = (1932//28) * (2464//28) = 6072 tokens
.