← Research · The formatting layer

Why the text layer earns its place

Image and text are complementary: add the on-screen text to a capped-resolution image and document accuracy jumps from 84.4% to 91.2%.

If the image is so good (see the previous piece), why bother sending the on-screen text at all? Because the two modalities are complementary: each recovers accuracy the other misses. This is the core of Noru's formatting layer: image first, with the text layer alongside.

Adding the text rescues a downsized image

Real screenshots get sent at capped resolution to control cost (more on that in the resolution piece). At low token budgets, the model can't read small text from pixels alone, and that is where the text layer pays off. Shpigel Nacson et al. show that integrating an OCR/text modality into a vision model, at a tight 448×448 budget, lifts DocVQA accuracy from 56.0% to 86.6% on InternVL2 and 84.4% to 91.2% on Qwen2-VL¹¹DocVLM: Make Your VLM an Efficient Reader Shpigel Nacson et al. (AWS AI Labs), 2024. The image sets the scene; the text fills in the characters the downscaling blurred.

The models named here are from 2024, but the complementarity is structural, not a quirk of one generation. Characters lost to downscaling have to come from somewhere, and a sharper model doesn't change that: the text layer is the cheapest place to put them back.

It's structure, not just characters

Dumping raw OCR words isn't the same as sending well-formatted text. The same work finds that layout-aware text (words plus their positions and reading order) beats a flat string of OCR tokens at the same token budget¹. That's why Noru doesn't just attach a blob of recognized text: it preserves order and structure, the way the model can actually use it.

Google's ScreenAI reaches the same conclusion for screens and infographics specifically: adding the OCR text as an extra input improves answer accuracy by up to ~4.5% across screen- and document-QA tasks²²ScreenAI: A Vision-Language Model for UI and Infographics Understanding Baechler et al. (Google DeepMind), 2024.

Where each modality wins

Content	Text layer alone	The image
Dense, text-heavy document	Often enough to match vision	Adds little
Chart / infographic / diagram	Misses the visual relationships	Decisive
UI with small labels at low res	Recovers the unreadable text	Sets the layout

Lee et al.'s Pix2Struct work captures the boundary: on text-heavy documents a layout-aware text model is competitive with a pure-vision model, but the moment layout, charts, or infographics enter, the visual signal dominates³³Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding Lee et al., 2023. Neither modality is universally best, which is the whole argument for sending both.

The screenshot tells the model what it's looking at; the text layer guarantees it can read every character even after the image is right-sized. Noru ships them together, image plus its text in reading order, so you never have to choose.

Sources

DocVLM: Make Your VLM an Efficient Reader, Shpigel Nacson et al. (AWS AI Labs), 2024
ScreenAI: A Vision-Language Model for UI and Infographics Understanding, Baechler et al. (Google DeepMind), 2024
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding, Lee et al., 2023