← Research · Audio and perception

Perceptual context needs a boundary

The more perceptual the payload gets, the more scoped it has to be. Screen and audio capture are powerful precisely because they can expose the wrong thing.

Perceptual context is useful because it is close to what the user actually saw and heard. That is also the risk. A tool that can capture a screen frame, user narration, and eventually system audio should not behave like background telemetry.

The right product shape is explicit, scoped capture: one user action, one bounded handoff, visible enough that the user knows what is leaving the machine.

Audio expands the privacy surface

The W3C Screen Capture specification treats audio capture as a special privacy case. It notes that capturing audio alongside video can expose extra information about system applications, and that the audio source set is not necessarily the same as the shared video source set¹¹Screen Capture W3C, 2025. A window capture could otherwise be paired with unrelated system audio. The spec's direction is clear: audio should be optional and consented separately.

That maps directly to Noru. Screen capture is already sensitive. System audio is more sensitive because it can include other apps, calls, media, or people near the machine. Until that surface is implemented, the site should keep calling it planned; when it is implemented, it should be opt-in and visibly scoped.

Surface	Default posture	Why
Screen frame	Explicit shortcut capture	Screens contain secrets and unrelated work
Mic narration	User-initiated, short-lived	Speech can include bystanders or private intent
System audio	Future opt-in surface	Audio may not match the selected window
AI recap	Editable, not authoritative	Summaries can miss or misplace details

Audio models have their own safety class

OpenAI's GPT-4o system card names audio-specific risks including speaker identification, unauthorized voice generation, ungrounded inference, and disallowed audio content²²GPT-4o System Card OpenAI, 2024. The point is not that every app hits those risks equally. The point is that serious model providers treat audio as a distinct safety surface, not just another text field.

Noru should do the same in product language. "Hear" can be an honest direction of travel, but it should not imply hidden recording, background meeting capture, or perfect interpretation of every sound on the Mac.

Recaps are useful, but not neutral

The same caution applies after capture. Asthana et al. evaluated an LLM-powered meeting recap system with seven users at Microsoft and found summaries, highlights, and action items valuable in different contexts, while user editing behavior revealed varying alignment between AI recaps and what people actually needed³³Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system Asthana et al., 2025. The small study is not a universal benchmark, but it is a useful warning: audio-derived summaries should be treated as editable working context, not ground truth.

For a developer handoff, that means the raw ingredients matter. The screenshot, ordered text, and transcript should remain inspectable. If Noru adds system audio, the payload should make the source clear enough for the user and the receiving AI to reason about where each piece came from.

Perceptual context is not "capture everything." It is capture the smallest visible and audible slice that changes the answer, then hand it over in a form the user can inspect.

Sources

Screen Capture, W3C, 2025
GPT-4o System Card, OpenAI, 2024
Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system, Asthana et al., 2025