← Research · Audio and perception

A transcript is not the same as hearing

A transcript is the right first audio layer, but native audio models exist because timing, speakers, and non-speech sounds can carry context text drops.

The safest first audio surface is text: transcribe what the user said, attach it to the capture, and let the model read it. That is already useful. It is also incomplete.

The distinction matters for Noru's roadmap. A transcript is a strong formatting layer for speech. Raw audio and system audio are separate surfaces, with more signal and more risk.

Transcription got good enough to matter

The Whisper paper is the clean baseline. Radford et al. trained speech recognition on 680,000 hours of multilingual, multitask supervision and found the models generalized well across benchmarks, often competitively with supervised systems in a zero-shot setting¹¹Robust Speech Recognition via Large-Scale Weak Supervision Radford et al., 2022. That scale is why transcript-first workflows became practical: speech can become ordinary model-readable text.

For Noru, this is the near-term layer. A mic transcript can travel next to the screenshot the same way ordered OCR does: cheap to inspect, easy to quote, easy to redact, and broadly compatible with text-first AI tools.

But audio can carry more than words

Modern audio-language models are not just ASR wrapped around a chatbot. Qwen2-Audio is described as accepting audio signals directly, with separate modes for voice chat and audio analysis; the paper gives examples where an audio segment contains sounds, multiple speakers, and a command, and the model responds to the content of that audio²²Qwen2-Audio Technical Report Chu et al., 2024.

Gemini 1.5 points the same direction at long context scale: the technical report describes multimodal models that reason over millions of tokens of context, including long documents and hours of video and audio³³Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Gemini Team, Google, 2024. GPT-4o's system card likewise frames audio, image, video, and text as inputs to the same model family⁴⁴GPT-4o System Card OpenAI, 2024.

That is the reason not to flatten the roadmap into "transcripts forever." A transcript captures words. Direct audio can, when supported by the destination model, preserve timing, speaker turns, and non-speech events that never become text cleanly.

Audio layer	Good for	Honest boundary
Mic transcript	User intent and narration	Drops timing and non-speech sound
Raw mic audio	Turn-taking, interruptions, ambiguous speech	Requires model support and clearer consent
System audio	Alerts, media, app sounds, calls	Planned surface; do not claim until built

The payload should degrade gracefully

The robust design is not "always send every modality." It is a layered payload: image, ordered text, optional mic transcript, and later optional audio only when it changes the answer. If a destination cannot accept audio, the transcript remains useful. If it can, raw audio can become an additional source rather than a replacement for text.

This mirrors the screen side of Noru. The image and OCR are complementary; neither is the whole truth. Audio is the same. The transcript is the text layer. The sound itself is the perceptual layer.

Treat transcription as the first audio layer, not the final one. It makes speech portable today, while leaving room for direct audio when the model and product surface are ready.

Sources

Robust Speech Recognition via Large-Scale Weak Supervision, Radford et al., 2022
Qwen2-Audio Technical Report, Chu et al., 2024
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, Gemini Team, Google, 2024
GPT-4o System Card, OpenAI, 2024