Identity Graph

One person uses three devices. Analytics sees three strangers. Your attribution model scores all three journeys wrong — and every downstream system inherits the error.

12 min read|Updated March 2026

01.Your Attribution Is Only as Good as Your Identity Graph

Attribution models don't fail because of bad math. They fail because they don't know who they're measuring.

One person, many cookies

A single user's real journey — three touchpoints, three separate cookies, zero connection:

Instagram in-app browser — user taps an ad. The in-app browser runs in a sandboxed environment with its own cookie.
Safari on iPhone — later that evening, the user browses your site directly. Safari uses a separate cookie jar from the in-app browser.
Chrome on desktop — next day at work, the user converts. A third device, a third cookie.

Analytics treats each as a separate visitor. Three sessions, three anonymous IDs, zero connection between them. The ad gets no credit. The conversion appears organic. The user's journey is invisible.

Interactive — Before / After

Identity Graph

Raw sessions

3 anonymous visitors, no clear conversion path

Instagram In-AppMar 1

Paid Social ad click

cookie_1

Safari iPhoneMar 3

Direct browse

cookie_2

Chrome DesktopMar 5

Conversion

cookie_3

Toggle to see the same visits as raw sessions versus a stitched user journey.

The in-app browser problem

This is the most damaging source of identity fragmentation in paid social.

When someone taps an ad on Instagram, Facebook, or TikTok, it opens in the app's embedded browser — a completely separate cookie sandbox from Safari or Chrome. The user sees your landing page, browses a few pages, then switches to their default mobile browser and converts.

Analytics records:

Session 1 (in-app browser): source = instagram / paid, no conversion
Session 2 (Safari): source = direct / none, conversion credited

Facebook claims the conversion via its own pixel. Your analytics credits "direct." The actual first-touch channel — the ad that introduced the user — gets nothing.

This isn't an edge case. On mobile-heavy sites, a significant share of paid social traffic arrives through in-app browsers.

In-app browser creates a separate cookie sandbox — the ad click and the conversion appear as two unrelated visitors.

Targeting identity graphs solve the wrong problem

The probabilistic identity resolution industry — LiveRamp, UID2, Experian — was built for ad targeting. Targeting tolerates false positives. If you show an ad to someone who isn't actually the same person, you wasted a fraction of a cent. The cost of a wrong match is trivial.

Measurement cannot tolerate false positives. A single incorrect identity link can:

Credit the wrong channel for a conversion
Inflate one campaign's ROAS while deflating another's
Corrupt budget optimization inputs
Cascade errors through every downstream model

Different problems require different tools. Targeting identity graphs optimize for reach (link as many identifiers as possible). Measurement identity graphs must optimize for precision (only link identifiers when confidence is high).

Statistical matching error compounds with journey length

Identity accuracy degrades exponentially across multi-touch journeys.

At 80% identity accuracy (a generous estimate for statistical matching methods):

Touchpoints	Accuracy
2	80%
3	64%
4	51%
5	41%

Each identity link between touchpoints multiplies the error probability. By the time a user has interacted across three devices, statistical identity matching is barely better than a coin flip.

This is why SegmentStream uses only deterministic signals for identity stitching. Every link between two anonymous IDs requires a shared, verified identifier — a login, an email_hash, a click_id, or same-network activity within a tight time window.

02.Three Resolution Levels. One Deterministic Graph.

SegmentStream's identity graph resolves identities at three levels — individual, household, and organizational. Each level uses different signals with different confidence windows. The same signal can create links at different levels depending on context — an IP address links devices for one person, family members at home, or coworkers at the office. All matching is deterministic: shared keys only, no statistical inference.

Individual identity — one person, multiple devices

Four standard signals stitch one person's sessions across devices and browsers:

User ID — authenticated login links the current anonymous session to every previous session where the user logged in. Gold standard for identity resolution. 180-day window.
Email Hash — email captured at signup, checkout, or newsletter subscription is hashed with SHA-256 and used as an identity key. Same hash on a different device links the sessions. 180-day window. Status: supported in the pipeline but not yet deployed in production. Requires adding ?ehash= parameters to marketing email URLs.
Click ID — ad platforms append identifiers (gclid, fbclid, ttclid) to URLs. The SDK captures these on landing and preserves them across internal navigations — including the in-app browser to native browser bridge. 30-day window.
IP Address — two anonymous sessions sharing the same IP within the confidence window are linked. Weakest standard signal due to shared networks. 3-day window. Safeguard: IPs shared by more anonymous IDs than the 99th percentile are flagged as outliers and all links via that IP are discarded.

Email link stitching — the full flow

Email hash stitching follows a specific technical flow:

Capture: User subscribes on Device A. Email is captured and hashed with SHA-256.
Tag: All marketing email URLs include an ?ehash=abc123 parameter containing the hashed email.
Stitch: User clicks the email link on Device B. The ehash parameter matches the hash from Device A. Both anonymous sessions are now linked to the same universal user.

This works because the email hash is the same regardless of which device opens the link. No cookies required. No third-party data.

html

<!-- Before: standard marketing email link -->
<a href="https://example.com/promo?utm_source=newsletter&utm_medium=email">
  Shop now
</a>

<!-- After: with email hash parameter for identity stitching -->
<a href="https://example.com/promo?utm_source=newsletter&utm_medium=email&ehash={{SHA256_EMAIL}}">
  Shop now
</a>

Email hash stitching: capture, tag, stitch — across any device. SHA-256 hash only, no raw email stored, 180-day window.

Household identity — family, shared context

Household-level identity links different people within the same family or home — not the same person across devices, but related users who share context.

IP Address — family members on the same home Wi-Fi share an IP. Within the 3-day window, their sessions are linked at the household level.
Click ID — when someone shares a campaign link with a family member (via WhatsApp, iMessage, or email), the click_id propagates with the URL. Both visits carry the same click_id, enabling household-level attribution.

Organizational identity — company, buying committee

B2B journeys involve multiple people at the same company:

Account ID — multi-tenant B2B products where multiple users share a product account.
Email Domain Hash — a junior employee signs up for a trial, a manager requests a demo, a VP visits the pricing page. If all use @company.com emails, the domain hash connects them into one organizational journey. Always pass account_id alongside email_domain_hash so the identity graph can enforce hard boundaries between organizations sharing the same email domain (e.g. agencies using client domains).
IP Address — employees at the same office share a corporate IP. Within the 3-day window, their sessions are linked at the organizational level. The 99th-percentile outlier safeguard prevents large offices from creating false merges.

IndividualSame person across devices

User ID

user_id

180d

Logged-in users

Email Hash

email_hash

180d

Email link clicks

Click ID

click_id

30d

Paid ad clicks

IP Address

ip_address

Short-window matching

HouseholdFamily link-sharing or same home Wi-Fi

Home Network IP

ip_address

Same Wi-Fi household

Shared Link Click

click_id

30d

Shared campaign links

OrganizationalCompany-level identity

Account ID

account_id

180d

B2B account linking

Email Domain

email_domain_hash

180d

Company grouping

Corporate Network IP

ip_address

Corporate network

Identity signals organized by resolution level. All signals use deterministic matching only — no probabilistic inference, no fingerprinting.

Custom identity keys

Beyond the standard signals, SegmentStream supports arbitrary custom identity keys. Any event property can be designated as an identity key with a configurable confidence window. Examples from production:

cart_id — cross-device cart recovery. User starts checkout on one device, completes on another.
quote_id — insurance or B2B quotes sent via email, opened on a different device
checkout_id — cross-device checkout linking
loyalty_card_id — retail loyalty programs
phone_hash — SMS marketing attribution
order_id — post-purchase support interactions linked to acquisition

03.Connected Components on Deterministic Edges

The identity graph runs as a daily batch pipeline. It takes raw events from GA4 and the SegmentStream SDK, extracts identity keys, finds users that share keys, and groups them into universal user profiles.

Five-stage pipeline

Stage 1 — Load raw events and extract identity keys

Two data sources feed the pipeline:

Analytics events — page views, transactions, and custom events from any analytics platform (GA4, Adobe Analytics, Amplitude, Heap, Segment, or others)
SegmentStream SDK events — lightweight client-side pings that capture click IDs and IP addresses across sessions, including the in-app browser to native browser bridge

From each event, the pipeline extracts all available identity keys: user_id, email_hash, click_id (gclid, fbclid, ttclid, etc.), ip_address, plus any configured custom keys. Each key is paired with the event's anonymous_id — the cookie-level identifier.

Stage 2 — Aggregate daily user profiles

All events are grouped by date and anonymous_id, collecting the distinct identity keys observed for each visitor on each day. This produces a daily snapshot: "anonymous_id X was seen with keys [user_id=123, ip=1.2.3.4, gclid=abc]."

Stage 3 — Calculate user links

The pipeline compares anonymous ID pairs. If two different anonymous IDs share at least one identity key within that key's confidence window, a link is created between them.

Skewed key filtering: Before creating links, any identity key shared by more than 100 anonymous IDs is discarded entirely. All links through that key are dropped. This prevents a single corporate IP address or shared WiFi network from merging hundreds of unrelated users.

Stage 4 — Calculate connected components

The pipeline takes all pairwise links and finds connected components — groups of anonymous IDs that are transitively linked.

If identities are transitively connected, they are merged into a single component:

A is linked to B (via shared ip_address)
B is linked to C (via shared email_hash)
Therefore A, B, and C are all the same user — even though A and C share no direct identity key

Component size cap: Connected components larger than the 99th percentile are flagged as outliers and discarded. Components that large indicate shared infrastructure or a data quality issue, not real users.

Hard constraints: If two anonymous IDs have different account_id values, they are never linked, regardless of other shared keys. This prevents cross-account contamination in multi-tenant B2B products.

Stage 5 — Output to users table

Each anonymous ID is mapped to a universal_id — the identifier of the canonical user. The universal ID is the anonymous ID with the earliest first_visit timestamp in the component. This ensures the user's identity anchors to their oldest known session.

The output is written to the users table in the customer's BigQuery warehouse. Every downstream system — attribution, scoring, reporting — joins against this table to resolve anonymous IDs to universal users.

Daily batch pipeline — cookie signals flow through the identity graph and resolve into the users table. Data stays in your BigQuery.

Key properties

Deterministic only. Every link requires a shared, verified identity key. No statistical inference, no behavioral similarity, no device fingerprinting.

Transitive resolution. If A links to B and B links to C, all three get the same universal ID — even if A and C share no direct key. This is what makes connected components powerful: identity signals compound across touchpoints.

Warehouse-native. The pipeline reads from and writes to the customer's BigQuery project. No data leaves their warehouse. No external identity graph services. The customer owns their identity data.

Batch processing. The pipeline runs daily. Approximately one day of lag between an event occurring and it being reflected in the identity graph. This is a deliberate tradeoff — batch processing enables the full connected-components algorithm at scale.

Privacy by design

Non-consent users are invisible. Users who decline consent receive a non-consent-{uuid} anonymous ID that changes on every page load. These IDs are never linked to anything — the identity graph literally cannot see them.

No raw emails stored. Only SHA-256 hashes of email addresses are used as identity keys. The raw email is never written to the identity graph pipeline.

All data stays in customer's BigQuery. The pipeline reads from and writes to the customer's BigQuery project — no raw data or PII is transferred outside their infrastructure.

04.See It Work

Querying the identity graph via MCP

Two MCP tools expose the identity graph directly. get_identity_graph_statistics returns signal coverage, stitching rates, and device counts for a given project. get_user_journey resolves a specific user's cross-device timeline — showing every anonymous session, the signals that linked them, and the full attribution path.

+Reply...Opus 4.6

Identity Graph Statistics

Signal Coverage — Purchase (90d)

SignalCoverage

click_id56.2%

ip_address100%

email_hash73.2%

user_id88.4%

Converted users

1,228

2+ devices

584 (47.6%)

Avg devices/user

4.6

+Reply...Opus 4.6

Live MCP queries — identity graph statistics and cross-device user journey.

Production benchmarks

Median stitching rates across production projects, grouped by the number of deterministic identity parameters passed:

2 parametersmedian across production projects

0.0%

click_idIP

3 parametersmedian across production projects

0.0%

click_idIPuser_id / email_hash

Adding user_id or email_hash increases stitching by +32pp. Median across production projects. Each additional deterministic signal compounds the graph's ability to connect cross-device journeys.

More identity signals produce higher stitching rates — median benchmarks from production.

What the numbers tell you

Adding User ID lifts stitching from 35% to 67%. Projects passing only click_id and ip_address achieve a median 35% stitching rate across converted users. Adding a third deterministic signal — typically user_id or email_hash from a backend integration — pushes that to 67%. A +32 percentage point improvement from a single additional parameter.

"2+ devices" means 2+ anonymous sessions stitched. This includes true cross-device journeys (phone → laptop), but also same-device fragmentation: in-app browsers, cleared cookies, incognito sessions. All of these create orphaned sessions that need stitching regardless of cause.

Coverage quality matters as much as key count. Projects with GDPR consent constraints see ip_address and click_id coverage drop to ~70%, reducing their effective stitching rate even with 3 parameters. The benchmark assumes well-integrated projects with standard cookie consent.

The retargeting test

Here's a practical way to validate your identity graph is working:

Check first-touch attribution on retargeting and email campaigns. In a well-functioning identity graph, these should show close to zero first-touch conversions.

Why? Retargeting and email reach people who already visited your site. If the identity graph correctly stitches the retargeted user's new session to their original visit, first-touch credit goes to the channel that originally introduced them — not to "retargeting" or "email."

If retargeting shows significant first-touch conversions, it means the identity graph failed to link the retargeted session to the user's original visit. They look like new users when they're not.

Ideal result: zero first-click conversions for retargeting campaigns. That means identity stitching is doing its job — these campaigns are being correctly recognized as re-engagement, not acquisition.

First-Click Attribution

CHANNEL	WITHOUT IG	WITH IG
Paid Social	0	0
Paid Search	0	0
Organic	0	0
Retargeting	0	0
Email	0	0
Direct	0	0

Attribution shifts from Direct/None to paid channels when cross-device journeys are resolved.

05.What Goes Wrong and How to Avoid It

Using statistical matching for measurement decisions

Statistical matching — behavioral fingerprinting, device graph lookups, inferred identity links — is designed for ad targeting where false positives are cheap. When applied to measurement, false links corrupt attribution data silently. You won't see an error. You'll see wrong numbers that look plausible.

Fix: Use only deterministic identity signals for measurement. Accept a lower stitching rate in exchange for trustworthy data.

Not installing the SegmentStream SDK

Without the SDK, click_id preservation doesn't work. The SDK captures gclid, fbclid, ttclid, and other click parameters on landing and preserves them across internal page navigations. Without it, these values are lost after the landing page.

Fix: Install the SDK on all pages. It's a single JavaScript snippet.

Not sending User ID on every authenticated page load

A common implementation mistake: sending user_id to GA4 only on the login page. If a user is already logged in and visits other pages, those sessions don't carry user_id — and the identity graph can't link them.

Fix: Set user_id in the GA4 config on every page load for authenticated users, not just the login event.

Inconsistent email normalization before hashing

If one system hashes [email protected] and another hashes [email protected], the SHA-256 outputs will differ. The identity graph sees two different keys and can't link them.

Fix: Normalize before hashing — lowercase, trim whitespace, strip dots from Gmail addresses (if applicable). Apply the same normalization everywhere email hashes are generated.

Expecting real-time stitching

The identity graph runs as a daily batch pipeline. If a user visits on Device A in the morning and Device B in the afternoon, the stitching won't appear until the next day's pipeline run.

Fix: Design your workflows around daily data freshness. Identity stitching is not suited for real-time personalization — it's built for accurate measurement and reporting.

Ignoring consent compliance

Non-consent users receive a rotating anonymous ID (non-consent-{uuid}) that changes every page load. This is by design — the identity graph cannot track users who haven't consented. If consent rates are low, your stitching rate will be structurally limited.

Fix: Optimize your consent flow. Higher consent rates directly improve identity graph coverage. There is no technical workaround — and there shouldn't be.

Skipping the corporate IP safeguard

If you're seeing unusually high stitching rates (40%+), check for corporate IP merges. A shared office IP can link hundreds of unrelated users. The pipeline's default safeguard — discarding identity keys shared beyond the 99th percentile of anonymous IDs — catches most cases, but misconfigured custom keys can still cause issues.

Fix: Monitor the key combination breakdown in get_identity_graph_statistics. If a single IP or custom key is driving a disproportionate share of stitching, investigate.

Agency cross-account contamination in B2B

The account_id signal is primarily used by B2B and SaaS products for organizational identity — linking multiple employees within the same company into a single buying committee journey. The identity graph treats account_id as a hard constraint: two anonymous IDs with different account_id values are never linked, regardless of other shared signals.

Example: [email protected] visits your site under account_id "accenture-client-a" and [email protected] visits under account_id "accenture-client-b". Their email_domain_hash is identical (both @accenture.com), but because their account_id values differ, the identity graph will not stitch them. The hard constraint overrides the shared domain hash — cross-account contamination is blocked.

This creates a specific risk for agencies and consultancies (Dentsu, GroupM, Accenture) whose employees access multiple client accounts. If the CRM assigns a shared or agency-level account_id rather than a client-specific one, the identity graph will merge sessions across unrelated client accounts. The hard constraint only works when the boundary is correct upstream.

Fix: Ensure each account_id maps to a single end-client account, not an agency umbrella. Always pass account_id alongside email_domain_hash — this lets the identity graph automatically enforce hard boundaries between organizations, even when employees share the same email domain. Monitor component sizes via get_identity_graph_statistics — unusually large components often indicate cross-client leakage from shared account identifiers.

This whitepaper is best experienced on desktop. It includes interactive demos and data tables that show how the technology works. Send yourself a link to read later.