← Research · The formatting layer
Why the text layer earns its place
Image and text are complementary: add the on-screen text to a capped-resolution image and document accuracy jumps from 84.4% to 91.2%.
If the image is so good (see the previous piece), why bother sending the on-screen text at all? Because the two modalities are complementary: each recovers accuracy the other misses. This is the core of Noru's formatting layer: image first, with the text layer alongside.
Adding the text rescues a downsized image
Real screenshots get sent at capped resolution to control cost (more on that in the resolution piece). At low token budgets, the model can't read small text from pixels alone, and that is where the text layer pays off. Shpigel Nacson et al. show that integrating an OCR/text modality into a vision model, at a tight 448×448 budget, lifts DocVQA accuracy from 56.0% to 86.6% on InternVL2 and 84.4% to 91.2% on Qwen2-VL11DocVLM: Make Your VLM an Efficient Reader . The image sets the scene; the text fills in the characters the downscaling blurred.
The models named here are from 2024, but the complementarity is structural, not a quirk of one generation. Characters lost to downscaling have to come from somewhere, and a sharper model doesn't change that: the text layer is the cheapest place to put them back.
It's structure, not just characters
Dumping raw OCR words isn't the same as sending well-formatted text. The same work finds that layout-aware text (words plus their positions and reading order) beats a flat string of OCR tokens at the same token budget1. That's why Noru doesn't just attach a blob of recognized text: it preserves order and structure, the way the model can actually use it.
Google's ScreenAI reaches the same conclusion for screens and infographics specifically: adding the OCR text as an extra input improves answer accuracy by up to ~4.5% across screen- and document-QA tasks22ScreenAI: A Vision-Language Model for UI and Infographics Understanding .
Where each modality wins
| Content | Text layer alone | The image |
|---|---|---|
| Dense, text-heavy document | Often enough to match vision | Adds little |
| Chart / infographic / diagram | Misses the visual relationships | Decisive |
| UI with small labels at low res | Recovers the unreadable text | Sets the layout |
Lee et al.'s Pix2Struct work captures the boundary: on text-heavy documents a layout-aware text model is competitive with a pure-vision model, but the moment layout, charts, or infographics enter, the visual signal dominates33Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding . Neither modality is universally best, which is the whole argument for sending both.
The screenshot tells the model what it's looking at; the text layer guarantees it can read every character even after the image is right-sized. Noru ships them together, image plus its text in reading order, so you never have to choose.
Sources
- DocVLM: Make Your VLM an Efficient Reader, Shpigel Nacson et al. (AWS AI Labs), 2024
- ScreenAI: A Vision-Language Model for UI and Infographics Understanding, Baechler et al. (Google DeepMind), 2024
- Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding, Lee et al., 2023