/chat/completions
API by constructing a message
containing either an image URL
or a base64-encoded image
. The detail
parameter can be used to control how the image is preprocessed.
detail
parameter: low
, high
, and auto
.
For currently supported models, if detail
is not specified or is set to high
, the model will use the high
(“high resolution”) mode. If set to low
or auto
, the model will use the low
(“low resolution”) mode.
message
Containing ImagesDeepseekVL2
series models are suitable for handling short contexts. It is recommended to input no more than 2 images. If more than 2 images are provided, the model will automatically resize them to 384x384, and the specified detail
parameter will be ignored. Qwen
supports a maximum pixel area of 3584 * 3584 = 12845056
and a minimum pixel area of 56 * 56 = 3136
. Each image’s longer and shorter sides are first scaled to multiples of 28 (h * 28) * (w * 28)
. If the dimensions fall outside the minimum and maximum pixel ranges, the image is proportionally resized to fit within the range.detail=low
, all images are resized to 448 * 448
, consuming 256 tokens
.detail=high
, the image is proportionally scaled, with its dimensions rounded up to the nearest multiple of 28, then resized to fit within the pixel range (3136, 12845056)
, ensuring both dimensions are multiples of 28.224 * 448
, 1024 x 1024
, and 3172 x 4096
consume 256 tokens
when detail=low
.224 * 448
consumes (224/28) * (448/28) = 8 * 16 = 128 tokens
when detail=high
.1024 * 1024
is rounded to 1036 * 1036
and consumes (1036/28) * (1036/28) = 1369 tokens
when detail=high
.3172 * 4096
is resized to 3136 * 4060
and consumes (3136/28) * (4060/28) = 16240 tokens
when detail=high
.DeepseekVL2
processes two parts: global_view
and local_view
. The global_view
resizes the original image to 384x384
, while the local_view
divides the image into blocks of 384x384
. Additional tokens are added between blocks to maintain continuity.
detail=low
, all images are resized to 384x384
.detail=high
, images are resized to dimensions that are multiples of 384
, ensuring 1 <= h * w <= 9
.(h, w)
are chosen based on:
h
and w
are integers, and 1 <= h * w <= 9
.(h * w + 1) * 196 + (w + 1) * 14 + 1 tokens
.224 x 448
, 1024 x 1024
, and 2048 x 4096
consume 421 tokens
when detail=low
.
384 x 768
consumes (1 * 2 + 1) * 196 + (2 + 1) * 14 + 1 = 631 tokens
when detail=high
.
1024 x 1024
is resized to 1152 x 1152
and consumes (3 * 3 + 1) * 196 + (3 + 1) * 14 + 1 = 2017 tokens
when detail=high
.
2048 x 4096
is resized to 768 x 1536
and consumes (2 * 4 + 1) * 196 + (4 + 1) * 14 + 1 = 1835 tokens
when detail=high
.
224 * 448
, 1024 * 1024
, and 2048 * 4096
, when detail=low
is selected, will consume 256 tokens
each;
224 * 448
, when detail=high
is selected, has an aspect ratio of 1:2
, and will be resized to 448 x 896
. At this point, h = 1, w = 2
, consuming (h * w + 1) * 256 = 768 tokens
;
1024 * 1024
, when detail=high
is selected, has an aspect ratio of 1:1
, and will be resized to 1344 * 1344 (h = w = 3)
. Since 1024 * 1024 > 0.5 * 1344 * 1344
, at this point, h = w = 3
, consuming (3 * 3 + 1) * 256 = 2560 tokens
;
2048 * 4096
, when detail=high
is selected, has an aspect ratio of 1:2
, and under the condition 1 <= h * w <= 12
, the largest (h, w)
combination is h = 2, w = 4
. Therefore, it will be resized to 896 * 1792
, consuming (2 * 4 + 1) * 256 = 2304 tokens
.
*/}
DeepseekVL2
processes each image into two parts: global_view and local_view. global_view resizes the original image to 384*384
pixels, while local_view divides the image into multiple 384*384
blocks. Additional tokens are added to connect the blocks based on the width.
detail=low
, all images will be resized to 384*384
pixels.detail=high
, the images will be resized to dimensions that are multiples of 384(OpenAI uses 512)
, (h*384) * (w * 384)
, and 1 <= h*w <= 9
.h * w
will be chosen according to the following rules:
h
and w
are integers, and within the constraint 1 <= h*w <= 9
, traverse the combinations of (h, w)
.
(h*384, w*384)
pixels and compare with the original image’s pixels. Take the minimum value between the new image’s pixels and the original image’s pixels as the effective pixel value. Take the difference between the original image’s pixels and the effective pixel value as the invalid pixel value. If the effective pixel value exceeds the previously determined effective pixel value, or if the effective pixel value is the same but the invalid pixel value is smaller, choose the current (h*384, w*384)
combination.
(h*w + 1) * 196 + (w+1) * 14 + 1 token
224 x 448
, 1024 x 1024
, and 2048 x 4096
, when detail=low
is selected, will consume 421 tokens
each.384 x 768
, when detail=high
is selected, has an aspect ratio of 1:1
and will be resized to 384 x 768
. At this point, h=1, w=2
, consuming (1*2 + 1) * 196 + (2+1) * 14 + 1 = 631 tokens
.1024 x 1024
, when detail=high
is selected, will be resized to 1152*1152(h=w=3)
, consuming (3*3 + 1) * 196 + (3+1) * 14 + 1 = 2017 tokens
.2048 x 4096
, when detail=high
is selected, has an aspect ratio of 1:2
and will be resized to 768*1536(h=2, w=4)
, consuming (2*4 + 1) * 196 + (4+1) * 14 + 1 = 1835 tokens
.GLM-4.1V
supports a minimum pixel size of 28 * 28
, scaling image dimensions proportionally to the nearest integer multiple of 28
pixels.
If the scaled pixel size is smaller than 112 * 112
or larger than 4816894
, adjust the dimensions proportionally to fit within the range while maintaining multiples of 28
.
detail=low
: Resize all images to 448*448
pixels, resulting in 256 tokens
.detail=high
: Scale proportionally by first rounding the dimensions to the nearest 28-pixel
multiple, then adjusting to fit within the pixel range (12544, 4816894)
while ensuring both dimensions remain multiples of 28
.224 x 448
, 1024 x 1024
, 3172 x 4096
: With detail=low
, all consume 256 tokens
.224 x 448
: With detail=high
, since dimensions are within range and multiples of 28
, tokens = (224//28) * (448//28) = 8 * 16 = 128 tokens
.1024 x 1024
: With detail=high, dimensions are rounded to 1036*1036
(within range), tokens = (1036//28) * (1036//28) = 1369 tokens
.3172 x 4096
: With detail=high, rounded to 3192 x 4088
(exceeds maximum), then scaled proportionally to 1932 x 2464
, tokens = (1932//28) * (2464//28) = 6072 tokens
.