Skip to content

Requesting guidance - cardinality and metric structure for metrics for a large number of entities #7034

@kk2365

Description

@kk2365

I am trying to compare different ways to structure the metrics for monitoring large scale infrastructure and I need some suggestions / help. Please see details below taking servers as an example:

Scope:

  • over 9,000 servers
  • 40 metrics / timeseries per server

I also need to collect reference or metadata below

  • 7 locations
  • 2 environment categories - prod, non-prod
  • owned by approx 500 owner_ids
  • each server can have upto 2 user friendly aliases

Objective:

  • send metrics for all these servers across different locations to a single Cortex plant.
  • Owner of each server should be able to easily query or alert on the metrics for their servers by using conditions on attributes other than the server name (for eg "owner_id" (an ID that represents their organization that owns one or more server), location, aliases, etc)

Approaches I am considering fall in two broad categories:
a) enrich the metrics with additional labels for reference data or metadata at the time of collection, or
b) collect raw metrics without any labels and collect reference data separately through an info series and join the two in Cortex

Approach 1

  • Metric - (possible values 40)
  • Labels - server name (9000), aliases (max 2 per server), owner_id (max 1 per db but 500 overall), location (max 1 per db, 7 in total)
  • total timeseries - 40
  • Estimated cardinality per metric / timeseries - 18,000
  • At the time of data query or alerting, a user can use an exact condition on the metric name and can use regexes for aliases, owner_id and server names
  • This approach looks most intuitive to me for but I am not sure if a cardinality of 18K per series is good or bad.

Approach 2

  • Metric - server_ (possible values 500)
  • Labels - server name (max 100 per owner id, total 9000 globally), metric name (40), aliases (max 2 per db), location (max 1 per db, 7 in total)
  • total timeseries - 500
  • Cardinality per metric / timeseries - 100 x 40 x 2 = 8,000
  • At the time of data query or alerting, we would need to use a regex for metric name since it carries an ownerid and most of our users have multiple ownerids. We may still need to use regexes for aliases, and db names, along with an exact condition on metric name.
  • This approach looks less intuitive to me for end users because the actual name of metric is being captured as a label but it does reduce the cardinality to half.

Approach 3

  • Metric - server_<server_name> (possible values 9000)
  • Labels - owner_id (max 1 per server), metric name (40), aliases (max 2 per server), location (max 1 per db, 7 in total)
  • total timeseries - 9000
  • cardinality per metric / timeseries - 80
  • This approach looks least intuitive to me for end users but has least cardinality as well.

Approach 4

  • Metric name 1 - (40 possible values)
  • labels - server name
  • Info series - owner_id
  • labels - server name, location, aliases
  • Each user will need to join the two in order to view or alert on their metrics. Also, if there are one to many relationships between a db name and owner_id, we have seen that the joins sometimes fail

My questions:

  • am I calculating cardinality correctly?
  • What approach is better?
  • how do I determine the cardinality value that can become problem for a cortex plant?
  • is it generally better to have more timeseries with less cardinality than less timeseries with more cardinality?
  • I understand that with approach 4, we will likely have lower storage and ingestion overhead but wouldn't the frequent joins affect querier?

Thanks for your attention and help in advance!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions