← Research · Right-sizing

There's a right size for a screenshot

More detail helps the model read, up to a point. Reading accuracy climbs from 29% to 77% with resolution, but fine-OCR peaks and then declines.

Bigger is not better. There's a window where a screenshot is large enough for the model to read, but not so large that the model downscales it anyway and you've paid for tokens that bought nothing. Noru sizes every capture to that window.

Detail helps, up to a point

Wang et al.'s Qwen2-VL study sweeps the same model across rising image-token budgets and watches accuracy move¹¹Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Wang et al., 2024. On InfographicVQA, reading-heavy accuracy climbs steeply with detail:

Image detail (tokens)	InfoVQA	Fine OCR (OCRBench)
64	28.9%	572
576	65.7%	828
1600	75.0%	824
3136	77.3%	786

Two patterns stand out. Reading accuracy nearly triples as detail rises, so resolution clearly matters. But the fine-OCR score peaks around the middle and then declines. The authors are explicit: "merely increasing the image size does not always lead to improved performance… it is more important to choose an appropriate resolution," and note that over-enlarging small images pushes them out of distribution and hurts¹. There is a sweet spot.

The model will downscale you anyway

Past a threshold, sending more pixels is simply wasted. Anthropic's vision guidance caps Claude's input: images with a long edge beyond ~1568px are downscaled before the model ever sees them, and a ~1.15-megapixel image already costs on the order of 1,600 tokens²²Claude vision documentation (image sizing & token cost) Anthropic, 2025. OpenAI's GPT-4o tiles images into 512px blocks and bills per tile, so resolution maps directly to token cost³³Images and vision: tile-based tokenization OpenAI, 2025. Oversize the screenshot and you pay more for an image the model shrinks back down.

Model names age faster than this math. The caps and tiling are properties of how vision models ingest pixels, not of any one release: a newer model still downscales an oversized image and still bills you for the pixels it threw away.

Too small fails differently

The opposite error is just as real. On VBench (questions about small details in high-resolution images) even strong models land around chance, because static encoders downsample the fine detail away⁴⁴V: Guided Visual Search as a Core Mechanism in Multimodal LLMs Wu & Xie, 2023. Too small and the detail is gone before reasoning starts.

Too small and the detail's lost; too big and it's downscaled and billed for nothing. The right size is a narrow band, roughly Claude's ~1568px line, and hitting it every time is exactly the kind of fiddly, easy-to-get-wrong work Noru does for you.

Sources

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution, Wang et al., 2024
Claude vision documentation (image sizing & token cost), Anthropic, 2025
Images and vision: tile-based tokenization, OpenAI, 2025
V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs, Wu & Xie, 2023