# Self-Reported Reattribution

How SegmentStream makes invisible channels visible — LLM classification of free-text survey responses, stitched to sessions via identity graph and synthetic touchpoints.

---

Attribution measures clicks. But entire categories of marketing — podcasts, TV, word of mouth, out-of-home — work through influence, not clicks. They are invisible to every attribution model. Self-reported reattribution makes them visible.

## 01 — The Visibility Gap

Click-based attribution — as described in the [cross-channel attribution](/measurement-engine/cross-channel-attribution) methodology — tracks what it can see: ad clicks, organic search clicks, referral links. Every touchpoint in the journey needs a click with a URL parameter. If a channel does not produce a trackable click, it does not exist in your attribution data.

Entire categories of marketing work this way. Podcasts, TV, word of mouth, out-of-home, influencer content, radio, events — these channels create awareness and drive action, but the action they drive is almost never a direct click. The user hears about you on a podcast, then searches your brand name. They see a billboard, then types your URL directly.

Other channels fall in between. YouTube, TikTok, and AI chat (ChatGPT, Perplexity) generate some trackable clicks, but clicks capture only a fraction of their influence. A user watches a YouTube review, then searches your brand name a week later. They read a ChatGPT recommendation, then navigates directly to your site. The referral header exists, but the full influence is invisible to click-based attribution.

Attribution records the brand search click, the direct visit, the organic landing. The channel that actually introduced the user is invisible. The credit flows to the last trackable touchpoint — which is almost always brand search, organic brand, or direct.

> **[Interactive: SRA Comparison Table]**
> Static table showing channel distribution before and after SRA correction. Nine base channels with columns for "Without SRA" and "With SRA" conversions plus percentage change. Direct/None drops from 847 to 339 conversions. Five new channels appear below a separator (Word of Mouth: 266, Podcast: 194, TV/Radio: 97, OOH: 73, Influencer: 73) — these had zero attribution before SRA.

### The absorption problem

Brand search, organic brand, and direct traffic function as sponges. They absorb credit from every awareness channel that does not produce a click. A TV campaign that drives thousands of brand searches shows up as "Google Ads — Brand Search" in attribution. A podcast mention that sends listeners to your site shows up as "Direct / None." Word of mouth — your most valuable channel — is completely invisible.

The distortion compounds in two directions. Awareness channels appear to produce zero return, making them impossible to justify in budget discussions. Meanwhile, brand search and direct show artificially high performance, absorbing credit they did not earn. Budget flows toward the channels that are easiest to measure, not the ones that are most effective.

### Why standalone SRA is not enough

The obvious fix — ask the customer "How did you hear about us?" — is the right idea but incomplete on its own. Standalone self-reported attribution has structural limitations:

- **Fragmented coverage** — even at 85-95% response rates, some users do not answer. The gap is not random — users in a hurry, mobile users, and returning customers skip the question more often.
- **Response bias** — not all channels have equal response rates. Users who discovered you through memorable channels (a friend, a specific podcast) recall and report more reliably than those who saw a display ad they have already forgotten.
- **Channel-level only** — a user can tell you "I heard about you on a podcast," but not which campaign or ad group drove the visit. SRA provides channel attribution, not campaign-level granularity.
- **The triangulation fallacy** — a common response is to average SRA with click-based attribution, hoping the truth lies in the middle. It does not. Averaging two different measurement methods with different biases does not cancel the biases — it produces a third number with no clear meaning.

Self-reported data is most valuable as a *correction layer* on top of click-based attribution — not as a replacement for it. The hybrid approach: attribution handles every channel that produces a trackable click. Self-reported data corrects the channels that attribution cannot see. On conflicts — attribution says Facebook, the user says search — the paid click wins. A verified click is stronger evidence than a recalled impression.

SRA also measures something no click-based system can: true brand awareness. "Already knew about you," "word of mouth," and "a friend recommended you" quantify the organic strength of your brand — PR impact, customer referrals, and unprompted recognition.

> Attribution sees clicks. SRA sees influence. Neither is complete alone. Combined as a hybrid — with attribution as the foundation and SRA as the correction layer — the invisible channels become visible without compromising the channels that are already well-tracked.

## 02 — Deploying the Survey

One question. Free-text input. Placed as early as possible in the conversion flow. That is the entire deployment.

### Placement: earlier is better

The survey should appear at the first moment the user provides identifying information — email capture, account registration, or checkout. Post-purchase is the most common placement and the worst.

Response rates vary dramatically by placement:

- **During registration or checkout** (optional field) — 85-95% response rate. The user is already filling out a form. One more optional field adds no friction.
- **Post-purchase** — 50-60% response rate. The user has already completed their goal. A follow-up survey is an interruption, and response drops accordingly.

Earlier placement also captures a broader audience. A post-purchase survey only captures buyers. A registration-time survey captures every user who creates an account — including those who never purchase. For businesses with long consideration cycles (B2B, high-ticket e-commerce), post-purchase misses the majority of the addressable audience.

> **[Interactive: Survey Placement Comparison]**
> Side-by-side comparison of two survey placements. Left panel "Registration / Checkout" shows a form with a "How did you hear about us?" field embedded naturally alongside name and email fields — labeled "User must complete this step." Right panel "Post-Purchase" shows a standalone survey card after order confirmation — labeled "User already got what they came for."

### Free-text over dropdowns

Dropdowns are simpler to analyze but worse for measurement. Five reasons:

1. **Channel discovery.** A dropdown with pre-defined options cannot surface channels you did not think to include. Free-text reveals unexpected sources — specific podcast names, AI assistants (ChatGPT, Perplexity, Copilot), niche communities, and channels that did not exist when the dropdown was created. AI chat is showing meaningful volumes in production data.
2. **Retroactive reclassification.** Free-text responses can be re-classified at any time. When a new channel category emerges, you re-run the classifier on historical responses and retroactively surface the data. Dropdown responses are locked to the options available when the user answered.
3. **No priming bias.** A dropdown primes the user with options. If "Facebook" is listed first, users who are unsure will select it because it looks familiar. Free-text forces recall, which produces more accurate responses.
4. **Richer signal.** "My colleague Sarah mentioned you at a conference" is far more informative than a dropdown selection of "Word of mouth." Free-text captures context that structured responses cannot.
5. **Position bias.** Users disproportionately select the first option in a dropdown. Randomizing order helps but does not eliminate the effect. Free-text has no position bias.

> **[Interactive: Free-Text vs Dropdown]**
> Side-by-side comparison. Left: free-text input field capturing rich, unstructured responses. Right: dropdown selector spanning two columns, showing how pre-defined options limit discovery and introduce position bias.

The cost of free-text is classification complexity — raw text needs to be normalized into channel groups before it can be used in attribution. This is the problem the next section solves.

## 03 — Classification and Mapping

Collecting free-text responses is easy. Making them usable for attribution is the hard part. A single survey question generates 1,000+ unique text strings: "my friend told me," "heard on Joe Rogan," "saw an ad on Instagram," "google," "already knew about you." Each one needs to be mapped to a channel group that attribution can use.

### LLM classification

An LLM classifier normalizes raw text into channel groups. The input is the user's free-text response. The output is a standardized channel label: "Word of Mouth," "Podcast," "TV," "AI Chat," "Out-of-Home."

The classifier uses human-in-the-loop learning. Each human correction — "this response was classified as Social Media but should be Influencer" — enhances the classification prompt. Over time, the classifier converges on the project's specific channel vocabulary. Corrections are validatable via MCP: you can query the classifier to see every correction, every reclassification, and the current prompt.

> **[Interactive: LLM Classification Flow]**
> Three-column responsive layout. Left column: messy free-text responses in speech bubbles ("my friend told me", "saw on tiktok", "heard on podcast xyz", "google", "idk", "some ad somewhere"). Center: LLM classifier node with connecting lines. Right column: clean channel groups with triage categories color-coded — actionable (Friend/WOM, TikTok, Podcast), auto-ignore (idk), ambiguous (google, some ad somewhere).

### Triage: what to use, what to ignore

Not every survey response carries attribution signal. The classifier sorts responses into three categories:

- **Actionable** — maps to exactly one unambiguous channel. "My friend recommended you" maps to Word of Mouth. "Heard on a podcast" maps to Podcast. These responses override attribution where applicable.
- **Auto-ignore** — carries no useful signal. Null responses, garbage input ("n/a," "asdf"), "Other," "Organic," and opt-outs are excluded from the override logic entirely.
- **Ambiguous** — could mean multiple channels. "Social media" could be paid Meta, organic Instagram, or influencer content. These require data-driven resolution — checking which platforms have ad spend, which dominate — before they can be classified.

### The search engine problem

"Google" and "search engine" deserve special treatment. When a user says "I found you through Google," they could mean brand search (paid), generic search (paid), or organic search. These are three fundamentally different channels with different strategic implications. No amount of spend data resolves this ambiguity — the question is *what* the user searched for, not *which* engine they used.

The correct handling: ignore "search engine" responses for attribution override. Keep the original click-based attribution, which at least knows whether the click was paid or organic, brand or generic. Self-reported data adds nothing when click-based tracking already has the answer.

### Conflict resolution

When attribution and survey data disagree, the rule is simple: paid clicks always win. If attribution recorded a paid Facebook click and the user says "search engine," the Facebook click stands. A verified click is stronger evidence than a recalled impression.

Override logic applies to two categories of traffic only:

- **Brand search and organic brand** — these channels absorb awareness credit. A user who heard about you on a podcast searches your brand name, clicks the paid ad or the organic result, and attribution credits the search. The SRA response reveals the true discovery channel.
- **Non-paid traffic** — direct, organic, and referral sessions where no paid click exists. There is no paid evidence to protect, so the self-reported channel is the best available signal.

Everything else — paid non-brand traffic where a verified click exists — keeps its original attribution. The override is a correction, not a replacement.

This is the classification and mapping methodology the SegmentStream Measurement Engine implements. Responses are collected, classified by an LLM with human-in-the-loop learning, triaged for signal quality, and mapped to channel groups with conflict resolution rules that protect verified paid clicks.

## 04 — The Synthetic Touchpoint

Classification determines *what* channel the user came from. The synthetic touchpoint determines *how* that information enters the attribution pipeline. The goal: make self-reported channels appear in attribution reports alongside click-tracked channels, with no special handling required downstream.

### How it works

When a user's survey response produces an actionable channel and the override rules approve it, the pipeline creates a synthetic session record. This record looks exactly like a regular session in the attribution table — same schema, same fields — but with a fabricated timestamp: one second before the user's first recorded visit.

First-click attribution picks up the synthetic touchpoint as the earliest session in the user's journey. The self-reported channel becomes the attributed source. No changes to the attribution logic, no special cases, no parallel reporting systems. The correction integrates directly into the existing pipeline.

### Identity graph bridging

The survey is typically completed on one device — the one where the user registered or checked out. The first visit may have happened on a different device entirely. The [identity graph](/measurement-engine/identity-graph) bridges the gap: the user's `universal_id` links the survey device to the first-visit device. Without identity resolution, the synthetic touchpoint would have no journey to attach to.

### Override rules

The synthetic touchpoint is only created when override conditions are met. Three types of traffic are treated differently:

- **Brand search and organic brand** — overridden. These channels absorb awareness-channel credit by design. The self-reported response reveals the true discovery source.
- **Direct / none** — overridden. No existing attribution signal to protect. The self-reported channel is the only available evidence.
- **Paid non-brand** — protected. A verified paid click is stronger evidence than a survey response. The original attribution stands.

> **[Illustration: Synthetic Touchpoint Before/After]**
> Two horizontal timelines showing the same user journey. "Before SRA" timeline: three touchpoints (Direct/None visit, Organic search, Conversion) with first-touch attribution crediting Direct/None. "After SRA" timeline: a synthetic "SRA Signal" touchpoint is prepended one second before the first visit, attributing to TikTok. First-touch attribution now credits TikTok instead of Direct/None.

### The pipeline

Five stages, each feeding the next:

1. **Survey collection** — free-text response captured at registration, checkout, or email capture
2. **LLM classifier** — raw text normalized to channel groups with human-in-the-loop learning
3. **Identity graph** — survey user linked to their full cross-device journey via `universal_id`
4. **Synthetic touchpoint** — session record created one second before first visit, with the self-reported channel as traffic source
5. **Attribution reports** — standard first-click reports now include self-reported channels alongside click-tracked channels

> **[Illustration: SRA Pipeline Flow]**
> Left-to-right process diagram with five stages connected by arrows: Survey (free-text input) → LLM Classifier (channel mapping) → Identity Graph (user matching) → Synthetic Touchpoint (session creation) → Attribution Reports (unified output).

> The synthetic touchpoint is what makes SRA a correction layer instead of a parallel reporting system. Self-reported channels appear in the same reports, the same tables, and the same budget optimization logic as every click-tracked channel. One pipeline. One source of truth.

## 05 — What Goes Wrong and How to Avoid It

1. **Asking post-purchase instead of during registration.** Post-purchase surveys reach 50-60% of users. Registration-time placement reaches 85-95%. The gap is not just volume — post-purchase misses every user who creates an account but does not buy, which in B2B and high-ticket e-commerce is most of the audience.
2. **Using dropdowns instead of free-text.** Dropdowns are easier to analyze but cannot discover new channels, cannot be reclassified retroactively, and introduce priming and position bias. The classification problem is solvable. The missing-data problem is not.
3. **Overriding all traffic.** Only brand search and non-paid traffic should be overridden. Paid non-brand clicks are verified evidence — overriding them with survey responses destroys accurate data to replace it with less accurate data.
4. **Trusting "search engine" or "Google" at face value.** These responses are fundamentally ambiguous. They could mean brand search, generic paid, or organic. Click-based attribution already distinguishes between these — the survey response adds no information. Ignore them.
5. **Using raw text without classification.** Raw survey responses contain hundreds of variations for the same channel. Without LLM classification, you end up with a fragmented long-tail that is impossible to use for attribution.
6. **Treating SRA as standalone measurement.** Self-reported data has coverage gaps, response bias, and no campaign-level granularity. It is a correction layer for channels that attribution cannot see — not a replacement for click-based attribution.
7. **Skipping identity resolution.** The survey is completed on one device. The first visit may have happened on another. Without the identity graph bridging survey responses to first-visit sessions, the synthetic touchpoint has no journey to correct.
8. **Expecting campaign-level granularity.** A user can tell you "I heard about you on a podcast." They cannot tell you which ad group or bid strategy drove their awareness. SRA provides channel-level correction. Campaign-level optimization still depends on click-based attribution.

## 06 — See It Work

Everything described above runs inside the SegmentStream MCP server. Four MCP methods expose the full SRA pipeline. The first returns the project's SRA configuration — extraction SQL, classifier settings, and override rules. The second shows raw survey answers with their LLM classification. The third aggregates by channel to show distribution after SRA correction. The fourth traces individual user journeys to show the before-and-after: original attribution versus corrected attribution.

> **[Interactive: SRA Terminal Demo]**
> Four-tab terminal demo in Claude Code style. Tab 1 "SRA Settings" runs `mcp__segmentstream__get_sra_settings` showing pipeline configuration and override rules. Tab 2 "SRA Answers" runs `mcp__segmentstream__get_sra_answers` showing raw survey responses with LLM classification results. Tab 3 "SRA Channels" runs `mcp__segmentstream__get_sra_channels` showing aggregated channel distribution with percentages. Tab 4 "SRA Overrides" runs `mcp__segmentstream__get_sra_overrides` showing before-and-after channel shifts for individual users.

### What the methods show

`get_sra_settings` returns the project's pipeline configuration — whether classification is enabled, which classifier is active, and how override rules are structured. `get_sra_answers` shows every raw survey response and its LLM classification, making it easy to spot misclassifications and verify the triage logic.

`get_sra_channels` aggregates by classified channel — which channels appeared and what share of conversions they account for. Channels that were previously invisible — Word of Mouth, Podcast, TV, Out-of-Home — appear alongside click-tracked channels with real conversion counts. Underreported channels like YouTube, TikTok, and AI Chat show their true scale. The "Direct / None" bucket shrinks as absorbed credit is redistributed to the channels that actually drove it.

`get_sra_overrides` exposes the correction at the individual level: the original attributed channel, the self-reported channel, and the override decision. You can see exactly which users were reattributed, from which channel to which, and whether the override was triggered by the brand-search rule or the non-paid rule.

### Validating the output

Every override is auditable. You can trace a reattribution to a specific `universal_id`, see the original survey response, the LLM classification, the override rule that fired, and the synthetic touchpoint that was created. The data lives in your BigQuery warehouse. The override logic is deterministic and configurable per project.

### What changes in practice

Teams that deploy SRA correction typically see three shifts in their attribution data:

- **Invisible channels appear** — Podcast, TV, Word of Mouth, Out-of-Home, and other channels with zero click tracking show up in reports for the first time with real conversion data. These were always driving value — attribution simply could not see them.
- **Underreported channels show their true scale** — YouTube, TikTok, AI Chat, and prospecting social have some click tracking, but clicks capture only a fraction of their influence. SRA reveals the full contribution, typically 2-5x higher than click-only attribution.
- **Brand search and direct shrink** — the credit sponges lose their artificially concentrated attribution as credit flows back to the channels that actually earned it.

SRA also measures true brand awareness. "Already knew about you" and "word of mouth" responses quantify the channels that no click-based system can see: PR impact, organic brand strength, and customer referrals.
