USP derived fields

A derived field is one that doesn’t live as a column on the contacts table — it’s computed from event substrate at the moment a segment is evaluated. The Unified Substrate Plane (USP) provides the read-time machinery that makes this fast enough to use anywhere a stored column could be used.

Why read-time, not stored

The naive way to make analytical lenses queryable is to write them to per-user columns:

“Engagement on WhatsApp last clicked at” → column on contacts table
“Last attributed campaign” → column on contacts table
“Position in cart-to-checkout funnel” → column on contacts table
“Member of Q2-signups cohort” → column on contacts table

This breaks at scale. Every analytical lens you add becomes a daily backfill job that writes to millions of rows. The columns drift from the substrate (which is the actual source of truth). Schema churn is constant.

USP inverts this: the analytical lenses are computed at read time against the canonical substrate. Adding a lens means adding a FieldSource kind, not a column.

FieldSource pattern

Every queryable USP field maps to a FieldSource kind. The kind owns:

Where it reads from (PG / CH table + columns)
How it constructs predicates for each supported operator
What context it requires (some FieldSources are parameterised — e.g. ContactChannelEngagement embeds the channel name in the field path)


# pseudo-schema (Python)
class FieldSource:
    kind: Literal['ContactChannelEngagement', 'ConversionAttribution', ...]
    substrate: Literal['pg', 'ch']
    config: dict  # kind-specific (e.g. channel name, metric, lookback window)
 
    def build_predicate(self, operator: str, value: Any) -> SqlFragment: ...

The engine (CDP query engine) calls build_predicate per field reference during SQL construction.

Currently shipped FieldSource kinds

Kind	Reads from	Example field paths
`ContactsAttribute`	PG `contacts.*`	`contact.email`, `contact.phone`, `contact.first_name`
`EventsAggregate`	CH `events_with_ttl`	`event.transaction.purchase.count_30d`, `event.page.viewed.last_at`
`ContactChannelEngagement`	CH `delivery_events`	`engagement.whatsapp.last_clicked_at`, `engagement.email.opens_7d`
`ConversionAttribution`	CH `events_with_ttl WHERE event_type='conversion.attributed'`	`attribution.last_campaign`, `attribution.path_length`, `attribution.last_channel`
`FunnelPosition`	CH + funnel_definitions PG table	`funnel.cart_to_checkout.position`, `funnel.cart_to_checkout.completed`
`CohortMembership`	CH + cohort_definitions PG table	`cohort.q2_signups.is_member`
`LoyaltyAttribute`	PG `loyalty_*`	`loyalty.tier`, `loyalty.points_balance`, `loyalty.tier_expires_at`
`LoyaltySuppressionState`	PG `contacts.*`	`loyalty.consent_status`, `loyalty.suppression_score`, `loyalty.total_suppressions`
`MembershipAttribute`	PG `loyalty_membership_subscriptions`	`membership.tier`, `membership.next_renewal_at`, `membership.trial_expires_at`

Embedded config in field paths

Some kinds need parameters (which channel? which metric? what time window?). Rather than passing parameters at query time, the parameters are baked into the field path:

Field path	Decoded as
`engagement.whatsapp.last_clicked_at`	kind=ContactChannelEngagement, channel=whatsapp, metric=last_clicked_at
`attribution.last_campaign_30d`	kind=ConversionAttribution, metric=last_campaign, window=30d
`funnel.cart_to_checkout.position`	kind=FunnelPosition, funnel_id=cart_to_checkout, metric=position

This means the field path is a complete address — the segment-builder UI, the AI parser, and the SQL generator all interpret the same string the same way. No out-of-band parameters to drift.

The “ship raw, not pre-computed booleans” rule

When deciding what to expose, we prefer raw fields over pre-computed booleans wherever the operator might define the semantic differently.

For example: instead of shipping an is_dnc (do-not-contact) boolean, we ship five raw suppression-state fields (consent_status, is_auto_excluded, suppression_score, total_suppressions, last_suppression_date). Different workspaces define DNC differently — some treat any opt-out as DNC, others gate on suppression_score > 0.8. Shipping the raw fields lets operators express their own definition; shipping is_dnc would lock everyone into one.

Backfills

Adding a new FieldSource kind is non-destructive — no schema migration, no backfill. The kind is registered, the CompositeFieldRegistry discovers it, and the next segment evaluation that references it reads via the new kind.

The exception: when a kind reads from a table that needs hydration (e.g. funnel_definitions, cohort_definitions for the analytics→segment bridge), a one-off seed is required to populate those control tables. Per-contact data is not pre-computed; it’s read at query time.

Canonical event vocabulary

USP field paths that read from event substrate (EventsAggregate, ContactChannelEngagement, ConversionAttribution, etc.) bind to canonical event names — not raw provider-shape names. The translation lives in canonical_mapper.py:

CHANNEL_CANONICAL_EVENT_MAP — raw → canonical (despite the name, this dict holds every raw mapping, including non-channel events like page_view, screen_view, signup_completed, kyc_*, app_installed)
CANONICAL_TO_ACTIVITY_TYPE — canonical → USP (the FE Activity-Feed lens; some older tracker docs call this map _CANONICAL_TO_USP)

When adding a new mapping, search for the closest existing entry and add it adjacent — don’t rely on the dict’s name to find the right home.

Performance

Read-time computation works because:

ClickHouse aggregates are designed for the scan patterns these kinds use (per-contact event filtering with GROUP BY contact_id HAVING ...)
Postgres reads target indexed columns
Predicates are pushed down — only the contacts that match the rest of the segment definition trigger USP field evaluation
Aggregate queries are partitioned by workspace_id so cross-workspace queries are O(per-workspace), not O(global)

The trade-off: more CPU at read time, less storage and less drift. For Active Reach’s scale (low-thousands of workspaces, millions of contacts per workspace) this is the right balance.

Adding a new FieldSource kind — checklist

Define the kind in cdp_query_engine/ (substrate + config schema + build_predicate implementation)
Register the kind’s field paths in FIELD_SOURCES
Add to SYSTEM_VARIABLES so the FE surface knows about it (the parity drift guard enforces this)
Write a goldens test capturing the SQL emitted for each operator + sample value
Add a drift guard if the kind needs schema invariants from another table

What’s next

CDP query engine — the engine that calls FieldSource kinds
Segment substrate — where canonical membership lives
Data model — broader CDP architecture