# Identity Graph: Stitching Users Across Devices and Browsers

How SegmentStream resolves fragmented user journeys using deterministic identity stitching — without third-party cookies or statistical inference.

---

One person uses three devices. Analytics sees three strangers. Your attribution model scores all three journeys wrong — and every downstream system inherits the error.

## 01 — The Problem

### Your Attribution Is Only as Good as Your Identity Graph

Attribution models don't fail because of bad math. They fail because they don't know who they're measuring.

### One person, many cookies

A single user's real journey — three touchpoints, three separate cookies, zero connection:

1. **Instagram in-app browser** — user taps an ad. The in-app browser runs in a sandboxed environment with its own cookie.
2. **Safari on iPhone** — later that evening, the user browses your site directly. Safari uses a separate cookie jar from the in-app browser.
3. **Chrome on desktop** — next day at work, the user converts. A third device, a third cookie.

Analytics treats each as a separate visitor. Three sessions, three anonymous IDs, zero connection between them. The ad gets no credit. The conversion appears organic. The user's journey is invisible.

> **[Interactive: Journey Stitching Toggle]**
> Before/after toggle showing the same three visits (Instagram in-app, Safari iPhone, Chrome Desktop) as disconnected raw sessions versus a stitched user journey with linking signals. Toggle switches between "Raw Sessions" view (three separate cards with separate cookie IDs) and "Stitched Journey" view (unified timeline connected by identity signals).

### The in-app browser problem

This is the most damaging source of identity fragmentation in paid social.

When someone taps an ad on Instagram, Facebook, or TikTok, it opens in the app's embedded browser — a completely separate cookie sandbox from Safari or Chrome. The user sees your landing page, browses a few pages, then switches to their default mobile browser and converts.

Analytics records:

- **Session 1** (in-app browser): source = `instagram / paid`, no conversion
- **Session 2** (Safari): source = `direct / none`, conversion credited

Facebook claims the conversion via its own pixel. Your analytics credits "direct." The actual first-touch channel — the ad that introduced the user — gets nothing.

This isn't an edge case. On mobile-heavy sites, a significant share of paid social traffic arrives through in-app browsers.

> **[Illustration: In-App Browser Attribution Diagram]**
> Horizontal user journey showing Instagram opening an in-app browser (cookie_X) which links to Safari (cookie_Y) via shared IP address. Below the journey, two outcome boxes compare "With Identity Graph" (correctly attributes to instagram/paid) versus "Without Identity Graph" (misattributes to direct/none).

### Targeting identity graphs solve the wrong problem

The probabilistic identity resolution industry — LiveRamp, UID2, Experian — was built for **ad targeting**. Targeting tolerates false positives. If you show an ad to someone who isn't actually the same person, you wasted a fraction of a cent. The cost of a wrong match is trivial.

Measurement cannot tolerate false positives. A single incorrect identity link can:

- Credit the wrong channel for a conversion
- Inflate one campaign's ROAS while deflating another's
- Corrupt budget optimization inputs
- Cascade errors through every downstream model

Different problems require different tools. Targeting identity graphs optimize for **reach** (link as many identifiers as possible). Measurement identity graphs must optimize for **precision** (only link identifiers when confidence is high).

### Statistical matching error compounds with journey length

Identity accuracy degrades exponentially across multi-touch journeys.

At 80% identity accuracy (a generous estimate for statistical matching methods):

| Touchpoints | Accuracy |
|-------------|----------|
| 2 | 80% |
| 3 | 64% |
| 4 | 51% |
| 5 | 41% |

Each identity link between touchpoints multiplies the error probability. By the time a user has interacted across three devices, statistical identity matching is barely better than a coin flip.

This is why SegmentStream uses **only deterministic signals** for identity stitching. Every link between two anonymous IDs requires a shared, verified identifier — a login, an `email_hash`, a `click_id`, or same-network activity within a tight time window.

---

## 02 — The Framework

### Three Resolution Levels. One Deterministic Graph.

SegmentStream's identity graph resolves identities at three levels — individual, household, and organizational. Each level uses different signals with different confidence windows. The same signal can create links at different levels depending on context — an IP address links devices for one person, family members at home, or coworkers at the office. All matching is deterministic: shared keys only, no statistical inference.

### Individual identity — one person, multiple devices

Four standard signals stitch one person's sessions across devices and browsers:

- **User ID** — authenticated login links the current anonymous session to every previous session where the user logged in. Gold standard for identity resolution. 180-day window.
- **Email Hash** — email captured at signup, checkout, or newsletter subscription is hashed with SHA-256 and used as an identity key. Same hash on a different device links the sessions. 180-day window. *Status:* supported in the pipeline but not yet deployed in production. Requires adding `?ehash=` parameters to marketing email URLs.
- **Click ID** — ad platforms append identifiers (`gclid`, `fbclid`, `ttclid`) to URLs. The SDK captures these on landing and preserves them across internal navigations — including the in-app browser to native browser bridge. 30-day window.
- **IP Address** — two anonymous sessions sharing the same IP within the confidence window are linked. Weakest standard signal due to shared networks. 3-day window. Safeguard: IPs shared by more anonymous IDs than the 99th percentile are flagged as outliers and all links via that IP are discarded.

### Email link stitching — the full flow

Email hash stitching follows a specific technical flow:

1. **Capture:** User subscribes on Device A. Email is captured and hashed with SHA-256.
2. **Tag:** All marketing email URLs include an `?ehash=abc123` parameter containing the hashed email.
3. **Stitch:** User clicks the email link on Device B. The `ehash` parameter matches the hash from Device A. Both anonymous sessions are now linked to the same universal user.

This works because the email hash is the same regardless of which device opens the link. No cookies required. No third-party data.

```html
<!-- Before: standard marketing email link -->
<a href="https://example.com/promo?utm_source=newsletter&utm_medium=email">
  Shop now
</a>

<!-- After: with email hash parameter for identity stitching -->
<a href="https://example.com/promo?utm_source=newsletter&utm_medium=email&ehash={{SHA256_EMAIL}}">
  Shop now
</a>
```

> **[Illustration: Email Stitching Flow]**
> Horizontal user journey showing Device A (where email is captured and hashed) connecting through an email link (with `?ehash=` parameter) to Device B (where the hash matches). Below, the identity graph links both cookies via the shared email hash. SHA-256 hash only, no raw email stored, 180-day window.

### Household identity — family, shared context

Household-level identity links different people within the same family or home — not the same person across devices, but related users who share context.

- **IP Address** — family members on the same home Wi-Fi share an IP. Within the 3-day window, their sessions are linked at the household level.
- **Click ID** — when someone shares a campaign link with a family member (via WhatsApp, iMessage, or email), the `click_id` propagates with the URL. Both visits carry the same `click_id`, enabling household-level attribution.

### Organizational identity — company, buying committee

B2B journeys involve multiple people at the same company:

- **Account ID** — multi-tenant B2B products where multiple users share a product account.
- **Email Domain Hash** — a junior employee signs up for a trial, a manager requests a demo, a VP visits the pricing page. If all use `@company.com` emails, the domain hash connects them into one organizational journey. Always pass `account_id` alongside `email_domain_hash` so the identity graph can enforce hard boundaries between organizations sharing the same email domain (e.g. agencies using client domains).
- **IP Address** — employees at the same office share a corporate IP. Within the 3-day window, their sessions are linked at the organizational level. The 99th-percentile outlier safeguard prevents large offices from creating false merges.

> **[Illustration: Signals Grid]**
> Visual grid showing identity signals organized by three resolution levels. Individual level: User ID (180d, highest strength), Email Hash (180d, highest), Click ID (30d, high), IP Address (3d, moderate). Household level: Home Network IP (3d, moderate), Shared Link Click (30d, low). Organizational level: Account ID (180d, highest), Email Domain (180d, highest), Corporate Network IP (3d, moderate). Each card shows signal name, key name, time window, strength indicator, and best-for label.

### Custom identity keys

Beyond the standard signals, SegmentStream supports arbitrary custom identity keys. Any event property can be designated as an identity key with a configurable confidence window. Examples from production:

- `cart_id` — cross-device cart recovery. User starts checkout on one device, completes on another.
- `quote_id` — insurance or B2B quotes sent via email, opened on a different device
- `checkout_id` — cross-device checkout linking
- `loyalty_card_id` — retail loyalty programs
- `phone_hash` — SMS marketing attribution
- `order_id` — post-purchase support interactions linked to acquisition

---

## 03 — The Algorithm

### Connected Components on Deterministic Edges

The identity graph runs as a daily batch pipeline. It takes raw events from GA4 and the SegmentStream SDK, extracts identity keys, finds users that share keys, and groups them into universal user profiles.

### Five-stage pipeline

**Stage 1 — Load raw events and extract identity keys**

Two data sources feed the pipeline:

- **Analytics events** — page views, transactions, and custom events from any analytics platform (GA4, Adobe Analytics, Amplitude, Heap, Segment, or others)
- **SegmentStream SDK events** — lightweight client-side pings that capture click IDs and IP addresses across sessions, including the in-app browser to native browser bridge

From each event, the pipeline extracts all available identity keys: `user_id`, `email_hash`, `click_id` (`gclid`, `fbclid`, `ttclid`, etc.), `ip_address`, plus any configured custom keys. Each key is paired with the event's `anonymous_id` — the cookie-level identifier.

**Stage 2 — Aggregate daily user profiles**

All events are grouped by date and `anonymous_id`, collecting the distinct identity keys observed for each visitor on each day. This produces a daily snapshot: "`anonymous_id` X was seen with keys [user_id=123, ip=1.2.3.4, gclid=abc]."

**Stage 3 — Calculate user links**

The pipeline compares anonymous ID pairs. If two different anonymous IDs share at least one identity key within that key's confidence window, a link is created between them.

**Skewed key filtering:** Before creating links, any identity key shared by more than 100 anonymous IDs is discarded entirely. All links through that key are dropped. This prevents a single corporate IP address or shared WiFi network from merging hundreds of unrelated users.

**Stage 4 — Calculate connected components**

The pipeline takes all pairwise links and finds connected components — groups of anonymous IDs that are transitively linked.

If identities are transitively connected, they are merged into a single component:

- A is linked to B (via shared `ip_address`)
- B is linked to C (via shared `email_hash`)
- Therefore A, B, and C are all the same user — even though A and C share no direct identity key

**Component size cap:** Connected components larger than the 99th percentile are flagged as outliers and discarded. Components that large indicate shared infrastructure or a data quality issue, not real users.

**Hard constraints:** If two anonymous IDs have different `account_id` values, they are never linked, regardless of other shared keys. This prevents cross-account contamination in multi-tenant B2B products.

**Stage 5 — Output to users table**

Each anonymous ID is mapped to a `universal_id` — the identifier of the canonical user. The universal ID is the anonymous ID with the earliest `first_visit` timestamp in the component. This ensures the user's identity anchors to their oldest known session.

The output is written to the `users` table in the customer's BigQuery warehouse. Every downstream system — attribution, scoring, reporting — joins against this table to resolve anonymous IDs to universal users.

> **[Illustration: Identity Pipeline Flow]**
> Animated 3-part horizontal flow. Left: website wireframe generating cookie events. Center: identity graph circle where cookies connect and merge via shared identity parameters (ip_address, email_hash, user_id, click_id — one per round). Right: resolved users table accumulating universal_id rows. Four rounds demonstrate each stitching parameter in sequence.

### Key properties

**Deterministic only.** Every link requires a shared, verified identity key. No statistical inference, no behavioral similarity, no device fingerprinting.

**Transitive resolution.** If A links to B and B links to C, all three get the same universal ID — even if A and C share no direct key. This is what makes connected components powerful: identity signals compound across touchpoints.

**Warehouse-native.** The pipeline reads from and writes to the customer's BigQuery project. No data leaves their warehouse. No external identity graph services. The customer owns their identity data.

**Batch processing.** The pipeline runs daily. Approximately one day of lag between an event occurring and it being reflected in the identity graph. This is a deliberate tradeoff — batch processing enables the full connected-components algorithm at scale.

### Privacy by design

**Non-consent users are invisible.** Users who decline consent receive a `non-consent-{uuid}` anonymous ID that changes on every page load. These IDs are never linked to anything — the identity graph literally cannot see them.

**No raw emails stored.** Only SHA-256 hashes of email addresses are used as identity keys. The raw email is never written to the identity graph pipeline.

**All data stays in customer's BigQuery.** The pipeline reads from and writes to the customer's BigQuery project — no raw data or PII is transferred outside their infrastructure.

---

## 04 — In Practice

### See It Work

#### Querying the identity graph via MCP

Two MCP tools expose the identity graph directly. `get_identity_graph_statistics` returns signal coverage, stitching rates, and device counts for a given project. `get_user_journey` resolves a specific user's cross-device timeline — showing every anonymous session, the signals that linked them, and the full attribution path.

> **[Interactive: Identity Graph MCP Terminal]**
> Claude Code-style terminal with two rotating scripts. Script 1 runs `get_identity_graph_statistics` for Purchase conversion: shows a signal coverage table (click_id 56.2%, ip_address 100%, email_hash 73.2%, user_id 88.4%), converted user count (1,228), cross-device rate (47.6%), and average 4.6 devices per stitched user. Script 2 runs `get_user_journey` for the most recent conversion: shows a 3-device journey (iPhone in-app via instagram/paid_social, iPhone Safari via IP match, Desktop Chrome via user_id match) ending in a $248 purchase, with attribution impact analysis comparing "without" (direct/none gets 100% credit) versus "with" (instagram/paid_social credited as first touch).

#### Production benchmarks

Median stitching rates across production projects, grouped by the number of deterministic identity parameters passed:

> **[Interactive: Stitching Stats]**
> Animated bar chart showing two benchmarks. "2 parameters" (click_id + IP): 35% median stitching rate. "3 parameters" (click_id + IP + user_id/email_hash): 67% median stitching rate. Bars animate on scroll with count-up numbers.

#### What the numbers tell you

**Adding User ID lifts stitching from 35% to 67%.** Projects passing only `click_id` and `ip_address` achieve a median 35% stitching rate across converted users. Adding a third deterministic signal — typically `user_id` or `email_hash` from a backend integration — pushes that to 67%. A +32 percentage point improvement from a single additional parameter.

**"2+ devices" means 2+ anonymous sessions stitched.** This includes true cross-device journeys (phone to laptop), but also same-device fragmentation: in-app browsers, cleared cookies, incognito sessions. All of these create orphaned sessions that need stitching regardless of cause.

**Coverage quality matters as much as key count.** Projects with GDPR consent constraints see `ip_address` and `click_id` coverage drop to ~70%, reducing their effective stitching rate even with 3 parameters. The benchmark assumes well-integrated projects with standard cookie consent.

#### The retargeting test

Here's a practical way to validate your identity graph is working:

Check first-touch attribution on retargeting and email campaigns. In a well-functioning identity graph, these should show **close to zero first-touch conversions**.

Why? Retargeting and email reach people who already visited your site. If the identity graph correctly stitches the retargeted user's new session to their original visit, first-touch credit goes to the channel that originally introduced them — not to "retargeting" or "email."

If retargeting shows significant first-touch conversions, it means the identity graph failed to link the retargeted session to the user's original visit. They look like new users when they're not.

Ideal result: zero first-click conversions for retargeting campaigns. That means identity stitching is doing its job — these campaigns are being correctly recognized as re-engagement, not acquisition.

> **[Interactive: Attribution Split View]**
> Three-column comparison table showing 6 channels (Paid Social, Paid Search, Organic, Retargeting, Email, Direct) with first-touch conversion counts "Without Identity Graph" versus "With Identity Graph." Key shifts: Retargeting drops from 18 to 5, Email drops from 12 to 2, Direct drops from 51 to 38. Paid Social rises from 26 to 41, Paid Search from 22 to 32, Organic from 18 to 29. Numbers animate with count-up on scroll.

---

## 05 — Common Mistakes

### What Goes Wrong and How to Avoid It

#### Using statistical matching for measurement decisions

Statistical matching — behavioral fingerprinting, device graph lookups, inferred identity links — is designed for ad targeting where false positives are cheap. When applied to measurement, false links corrupt attribution data silently. You won't see an error. You'll see wrong numbers that look plausible.

**Fix:** Use only deterministic identity signals for measurement. Accept a lower stitching rate in exchange for trustworthy data.

#### Not installing the SegmentStream SDK

Without the SDK, `click_id` preservation doesn't work. The SDK captures `gclid`, `fbclid`, `ttclid`, and other click parameters on landing and preserves them across internal page navigations. Without it, these values are lost after the landing page.

**Fix:** Install the SDK on all pages. It's a single JavaScript snippet.

#### Not sending User ID on every authenticated page load

A common implementation mistake: sending `user_id` to GA4 only on the login page. If a user is already logged in and visits other pages, those sessions don't carry `user_id` — and the identity graph can't link them.

**Fix:** Set `user_id` in the GA4 config on every page load for authenticated users, not just the login event.

#### Inconsistent email normalization before hashing

If one system hashes `User@Example.com` and another hashes `user@example.com`, the SHA-256 outputs will differ. The identity graph sees two different keys and can't link them.

**Fix:** Normalize before hashing — lowercase, trim whitespace, strip dots from Gmail addresses (if applicable). Apply the same normalization everywhere email hashes are generated.

#### Expecting real-time stitching

The identity graph runs as a daily batch pipeline. If a user visits on Device A in the morning and Device B in the afternoon, the stitching won't appear until the next day's pipeline run.

**Fix:** Design your workflows around daily data freshness. Identity stitching is not suited for real-time personalization — it's built for accurate measurement and reporting.

#### Ignoring consent compliance

Non-consent users receive a rotating anonymous ID (`non-consent-{uuid}`) that changes every page load. This is by design — the identity graph cannot track users who haven't consented. If consent rates are low, your stitching rate will be structurally limited.

**Fix:** Optimize your consent flow. Higher consent rates directly improve identity graph coverage. There is no technical workaround — and there shouldn't be.

#### Skipping the corporate IP safeguard

If you're seeing unusually high stitching rates (40%+), check for corporate IP merges. A shared office IP can link hundreds of unrelated users. The pipeline's default safeguard — discarding identity keys shared beyond the 99th percentile of anonymous IDs — catches most cases, but misconfigured custom keys can still cause issues.

**Fix:** Monitor the key combination breakdown in `get_identity_graph_statistics`. If a single IP or custom key is driving a disproportionate share of stitching, investigate.

#### Agency cross-account contamination in B2B

The `account_id` signal is primarily used by B2B and SaaS products for organizational identity — linking multiple employees within the same company into a single buying committee journey. The identity graph treats `account_id` as a hard constraint: two anonymous IDs with different `account_id` values are never linked, regardless of other shared signals.

**Example:** john@accenture.com visits your site under `account_id` "accenture-client-a" and michael@accenture.com visits under `account_id` "accenture-client-b". Their `email_domain_hash` is identical (both @accenture.com), but because their `account_id` values differ, the identity graph will not stitch them. The hard constraint overrides the shared domain hash — cross-account contamination is blocked.

This creates a specific risk for agencies and consultancies (Dentsu, GroupM, Accenture) whose employees access multiple client accounts. If the CRM assigns a shared or agency-level `account_id` rather than a client-specific one, the identity graph will merge sessions across unrelated client accounts. The hard constraint only works when the boundary is correct upstream.

**Fix:** Ensure each `account_id` maps to a single end-client account, not an agency umbrella. Always pass `account_id` alongside `email_domain_hash` — this lets the identity graph automatically enforce hard boundaries between organizations, even when employees share the same email domain. Monitor component sizes via `get_identity_graph_statistics` — unusually large components often indicate cross-client leakage from shared account identifiers.
