Skip to content

Conversation

@ulemons
Copy link
Contributor

@ulemons ulemons commented Jan 5, 2026

Introduce Segment Aggregate States for Scalable CDP Metrics

Summary

This PR introduces a new architecture for computing CDP dashboard metrics that is designed to be scalable, predictable, and safe at large data volumes.

The main change is the introduction of a segment-level aggregate states datasource, built daily via a COPY pipe, and three lightweight sinks that derive metrics for:

  • subprojects
  • projects
  • project groups

Heavy aggregation and DISTINCT logic is moved out of the sinks and into a single batch pipeline.


New Components

1. Datasource

cdp_segment_metrics_agg_states_ds

A new datasource that stores one row per subproject segment per daily snapshot, containing:

  • activity counters as aggregation states
  • DISTINCT member and organization states (uniqCombined)
  • hierarchy identifiers (segmentId, parentId, grandparentId)

This datasource uses AggregatingMergeTree and is optimized for hierarchical rollups via *Merge() functions.


2. COPY Pipe (Core of the Change)

cdp_segment_metrics_agg_states_copy.pipe

This is the most complex and important component introduced in this PR.

Responsibilities:

  • Reads existing aggregate datasources:
    • cdp_member_segment_aggregates_ds
    • cdp_organization_segment_aggregates_ds
  • Finalizes per-(segment, member/org) metrics
  • Computes:
    • total activities
    • last-30-days activities
    • DISTINCT member states
    • DISTINCT organization states
  • Attaches hierarchy metadata from segments
  • Writes a complete daily snapshot into cdp_segment_metrics_agg_states_ds

This pipe runs once per day and centralizes all heavy computation and DISTINCT logic.


3. Sinks

All sinks are intentionally simple and fast.
They only:

  • read the latest daily snapshot
  • merge aggregation states
  • format the output
  • export to Kafka

cdp_dashboard_metrics_subproject_sink.pipe

Publishes metrics at subproject level by finalizing segment-level states.

cdp_dashboard_metrics_project_sink.pipe

Rolls up subproject states to project level using parentId and state merging.

cdp_dashboard_metrics_project_group_sink.pipe

Rolls up subproject states to project group


Note

Introduces a scalable, state-based pipeline for CDP segment metrics.

  • New cdp_segment_metrics_ds (AggregatingMergeTree) storing per-segment daily states: count (total/last-30) and uniqCombined (members/orgs), with hierarchy IDs
  • COPY pipe cdp_segment_metrics_copy_pipe computes states from existing member/org aggregates and latest activities snapshot; restricts to valid segments, computes once per latest snapshot, reuses empty states; writes full daily snapshot
  • Three lightweight sinks finalize/roll up latest snapshot and export to Kafka (cdp_dashboard_metrics_per_segment_sink): cdp_segment_metrics_subproject_sink, cdp_segment_metrics_project_sink, cdp_segment_metrics_project_group_sink (rollups via parentId/grandparentId)
  • Scheduled runs: COPY at 09:00; sinks at 09:30/09:35/09:40

Written by Cursor Bugbot for commit 87dac21. This will update automatically on new commits. Configure here.

@ulemons ulemons self-assigned this Jan 5, 2026
@ulemons ulemons added the Feature Created by Linear-GitHub Sync label Jan 5, 2026
@ulemons ulemons requested a review from epipav January 5, 2026 15:01
@ulemons ulemons marked this pull request as ready for review January 7, 2026 09:06
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@ulemons ulemons force-pushed the feat/total-segment-metrics branch from 355ef96 to bd87c2e Compare January 7, 2026 17:39
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@ulemons ulemons force-pushed the feat/total-segment-metrics branch from bd87c2e to 4e416ec Compare January 8, 2026 08:38
@github-actions
Copy link
Contributor

github-actions bot commented Jan 8, 2026

⚠️ Jira Issue Key Missing

Your PR title doesn't contain a Jira issue key. Consider adding it for better traceability.

Example:

  • feat: add user authentication (CM-123)
  • feat: add user authentication (IN-123)

Projects:

  • CM: Community Data Platform
  • IN: Insights

Please add a Jira issue key to your PR title.

@ulemons ulemons changed the title feat: add infra for segments metrics feat: add infra for segments metrics (CM-708) Jan 8, 2026
Copy link
Collaborator

@epipav epipav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍 added one comment and a nitpick

Comment on lines +24 to +26
FROM segments AS s
LEFT JOIN
cdp_segment_metrics_ds AS sa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: AFAIK, it's better to keep the smaller table on the right-hand side of the join for performance. If we see bad performance, let's try switching places

Copy link
Contributor Author

@ulemons ulemons Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment we have something like ~10 s, would that be reasonable for now ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Created by Linear-GitHub Sync

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants