CDP query engine
The CDP query engine is the internal layer that translates segment definitions and field references into SQL — for ClickHouse, Postgres, or both. Every system that asks the substrate a question goes through these primitives. They exist so that segment-rule logic isn’t reimplemented per call site (which historically produced drift, bugs, and SQL-injection vectors).
What the engine is responsible for
| Concern | How the engine handles it |
|---|---|
Translate operator vocabulary (equals, is_in, between, contains, is_null, relative_date) into SQL | Mapped at one place via the _build_pg_predicate and _build_ch_predicate helpers |
| Parameterise every operator argument | Values are bound via :param (Postgres) or {param:Type} (ClickHouse) — never interpolated |
| Coerce types correctly | Pythonic values → DB-native types per dialect (e.g. ISO strings → DateTime, lists → IN tuples) |
| Short-circuit edge cases | Empty IN lists return false immediately, not WHERE col IN () |
| Discover registered fields | CompositeFieldRegistry enumerates available field paths from app-layer registrations |
| Disambiguate dialect | Postgres-side resolution uses _build_pg_predicate; ClickHouse-side uses _build_ch_predicate; both share the same operator vocabulary |
Why a shared engine
Before the engine was extracted, each consumer (segment evaluator, suppression checker, audience sync, AI parser) generated SQL its own way. The result:
- 39 sites used SQL interpolation; several were unsafe
- Operator vocabulary diverged: some sites supported
between, others onlygt/lt transaction.purchasewas canonicalised in some places,order_placedin others- The same segment rule could produce different result sets depending on which consumer evaluated it
The engine resolves these by being the single point of SQL generation. Adding an operator means adding it once.
Operator vocabulary
The supported operators are the same across PG and CH resolvers:
| Operator | What it does |
|---|---|
eq / neq | Equality |
gt / gte / lt / lte | Numeric / date comparisons |
contains | Substring match (case-insensitive) |
between | Range — inclusive both ends |
in / not_in | Set membership |
is_null / is_not_null | Null checks |
relative_date | ”Last N days”, “next N days”, “is today”, “is this week” — see Relative date predicates below |
Relative date predicates
A class of operators that work across PG (for contacts.* attributes) and CH (for event-derived fields). Common forms:
| Predicate | Behaviour |
|---|---|
| ”Birthday in next 7 days” | Computes month/day window around now(), ignoring year |
| ”Anniversary today” | Auto-discovered for any contact attribute matching `(anniversary |
| ”Delivery yesterday” | Resolves actual_delivery vs scheduled delivery; supports delivery_recency_days |
| ”Loyalty tier expires next 14 days” | Reads loyalty_membership_subscriptions.tier_protection_expires_at |
The “Ambiguous phrases” feature (AMBIGUOUS_PHRASES) auto-mirrors tense — “delivery today” resolves whether the delivery is scheduled for today or already happened today, depending on context.
Field discovery
Every queryable field is registered via CompositeFieldRegistry. Registration declares:
- The field’s path (e.g.
contact.email,event.transaction.purchase.amount) - The source kind (ContactsAttribute, EventsAggregate, ContactChannelEngagement, FunnelPosition, etc.)
- Optional embedded config (e.g. the channel name for an engagement field, baked into the field path)
The is_filterable: true flag determines whether the field shows up in the segment-builder UI. The AI parser and the FE manual builder both read from the same registry — a registry-surface-parity drift guard ensures they stay in sync.
FieldSource pattern
Most field paths resolve to one of a small set of FieldSource kinds. Each kind handles its own SQL generation:
| FieldSource | Substrate | Example field path |
|---|---|---|
ContactsAttribute | Postgres contacts.* | contact.email |
EventsAggregate | ClickHouse events_with_ttl | event.transaction.purchase.count_30d |
ContactChannelEngagement | ClickHouse delivery_events | engagement.whatsapp.last_clicked_at |
ConversionAttribution | CH events_with_ttl WHERE event_type='conversion.attributed' | attribution.last_campaign |
FunnelPosition | CH + funnel def | funnel.cart_to_checkout.position |
CohortMembership | CH + cohort def | cohort.q2_signups.is_member |
LoyaltyAttribute | Postgres loyalty_* tables | loyalty.tier |
Each kind has a focused implementation. New analytical lenses become new kinds — the engine itself doesn’t need to change.
See USP derived fields for the read-time pattern that lets us add analytical lenses without storing per-user materialised columns.
Drift guards (CI)
Three CI tests prevent the engine from drifting away from its callers:
| Test | What it asserts |
|---|---|
test_field_registry_parity | FIELD_SOURCES keys are a subset of SYSTEM_VARIABLES (AI surface and manual surface agree on filterable fields) |
test_pg_resolver_vocabulary | Postgres resolver supports the same operator vocabulary as CH resolver |
test_canonical_purchase_event | No call site uses legacy order_placed; all converged on transaction.purchase |
These run on every PR and fail the build on regression.
SQL-injection posture
Every operator argument is bound, not interpolated. The engine refuses to accept a non-bound value — passing a raw string where a bound placeholder is expected is a TypeError, not an exception at SQL time. The 23-commit stack that extracted the engine closed 7 historical SQL-injection sites; the architecture is now safe-by-construction.
What’s next
- Segment substrate — what the engine queries against
- USP derived fields — the FieldSource pattern in depth
- Data model — broader CDP architecture
- Event governance — event naming and the canonical event vocabulary