← Research · Right-sizing
There's a right size for a screenshot
More detail helps the model read, up to a point. Reading accuracy climbs from 29% to 77% with resolution, but fine-OCR peaks and then declines.
Bigger is not better. There's a window where a screenshot is large enough for the model to read, but not so large that the model downscales it anyway and you've paid for tokens that bought nothing. Noru sizes every capture to that window.
Detail helps, up to a point
Wang et al.'s Qwen2-VL study sweeps the same model across rising image-token budgets and watches accuracy move11Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution . On InfographicVQA, reading-heavy accuracy climbs steeply with detail:
| Image detail (tokens) | InfoVQA | Fine OCR (OCRBench) |
|---|---|---|
| 64 | 28.9% | 572 |
| 576 | 65.7% | 828 |
| 1600 | 75.0% | 824 |
| 3136 | 77.3% | 786 |
Two patterns stand out. Reading accuracy nearly triples as detail rises, so resolution clearly matters. But the fine-OCR score peaks around the middle and then declines. The authors are explicit: "merely increasing the image size does not always lead to improved performance… it is more important to choose an appropriate resolution," and note that over-enlarging small images pushes them out of distribution and hurts1. There is a sweet spot.
The model will downscale you anyway
Past a threshold, sending more pixels is simply wasted. Anthropic's vision guidance caps Claude's input: images with a long edge beyond ~1568px are downscaled before the model ever sees them, and a ~1.15-megapixel image already costs on the order of 1,600 tokens22Claude vision documentation (image sizing & token cost) . OpenAI's GPT-4o tiles images into 512px blocks and bills per tile, so resolution maps directly to token cost33Images and vision: tile-based tokenization . Oversize the screenshot and you pay more for an image the model shrinks back down.
Model names age faster than this math. The caps and tiling are properties of how vision models ingest pixels, not of any one release: a newer model still downscales an oversized image and still bills you for the pixels it threw away.
Too small fails differently
The opposite error is just as real. On VBench (questions about small details in high-resolution images) even strong models land around chance, because static encoders downsample the fine detail away44V: Guided Visual Search as a Core Mechanism in Multimodal LLMs . Too small and the detail is gone before reasoning starts.
Too small and the detail's lost; too big and it's downscaled and billed for nothing. The right size is a narrow band, roughly Claude's ~1568px line, and hitting it every time is exactly the kind of fiddly, easy-to-get-wrong work Noru does for you.
Sources
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution, Wang et al., 2024
- Claude vision documentation (image sizing & token cost), Anthropic, 2025
- Images and vision: tile-based tokenization, OpenAI, 2025
- V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs, Wu & Xie, 2023