← Research · Audio and perception
Perceptual context needs a boundary
The more perceptual the payload gets, the more scoped it has to be. Screen and audio capture are powerful precisely because they can expose the wrong thing.
Perceptual context is useful because it is close to what the user actually saw and heard. That is also the risk. A tool that can capture a screen frame, user narration, and eventually system audio should not behave like background telemetry.
The right product shape is explicit, scoped capture: one user action, one bounded handoff, visible enough that the user knows what is leaving the machine.
Audio expands the privacy surface
The W3C Screen Capture specification treats audio capture as a special privacy case. It notes that capturing audio alongside video can expose extra information about system applications, and that the audio source set is not necessarily the same as the shared video source set11Screen Capture . A window capture could otherwise be paired with unrelated system audio. The spec's direction is clear: audio should be optional and consented separately.
That maps directly to Noru. Screen capture is already sensitive. System audio is more sensitive because it can include other apps, calls, media, or people near the machine. Until that surface is implemented, the site should keep calling it planned; when it is implemented, it should be opt-in and visibly scoped.
| Surface | Default posture | Why |
|---|---|---|
| Screen frame | Explicit shortcut capture | Screens contain secrets and unrelated work |
| Mic narration | User-initiated, short-lived | Speech can include bystanders or private intent |
| System audio | Future opt-in surface | Audio may not match the selected window |
| AI recap | Editable, not authoritative | Summaries can miss or misplace details |
Audio models have their own safety class
OpenAI's GPT-4o system card names audio-specific risks including speaker identification, unauthorized voice generation, ungrounded inference, and disallowed audio content22GPT-4o System Card . The point is not that every app hits those risks equally. The point is that serious model providers treat audio as a distinct safety surface, not just another text field.
Noru should do the same in product language. "Hear" can be an honest direction of travel, but it should not imply hidden recording, background meeting capture, or perfect interpretation of every sound on the Mac.
Recaps are useful, but not neutral
The same caution applies after capture. Asthana et al. evaluated an LLM-powered meeting recap system with seven users at Microsoft and found summaries, highlights, and action items valuable in different contexts, while user editing behavior revealed varying alignment between AI recaps and what people actually needed33Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system . The small study is not a universal benchmark, but it is a useful warning: audio-derived summaries should be treated as editable working context, not ground truth.
For a developer handoff, that means the raw ingredients matter. The screenshot, ordered text, and transcript should remain inspectable. If Noru adds system audio, the payload should make the source clear enough for the user and the receiving AI to reason about where each piece came from.
Perceptual context is not "capture everything." It is capture the smallest visible and audible slice that changes the answer, then hand it over in a form the user can inspect.
Sources
- Screen Capture, W3C, 2025
- GPT-4o System Card, OpenAI, 2024
- Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system, Asthana et al., 2025