Prometheus Debugging when Integrated with Cortex and Grafana

This document will list few types of errors and provide information about the related configuration to update/investigate further.

Prometheus Error Logs

Error: Skipping resharding, last successful send was beyond threshold

Description: Prometheus deployed on Kafka cluster, was not able send data to cortex instance. Giving below errors:

Kafka Cluster Prometheus Logs:

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795732 minSendTimestamp=1671795741

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795761 
 minSendTimestamp=1671795771

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Failed to send batch, retrying" err="Post \"https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push\": context deadline exceeded"

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795761 minSendTimestamp=1671795781

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Failed to send batch, retrying" err="Post \"https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push\": context canceled"

component=remote level=error url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="non-recoverable error" count=6 exemplarCount=0 err="context canceled"

component=remote level=error url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="non-recoverable error" count=5 exemplarCount=0 err="context canceled"

Verifications: It can be seen in Nginx-ingress dashboard shown in "gaps" which represent no data coming at cortex ingress instance.

Rationales:

Prometheus default remote_wrtie config
Cortex default distributor and ingester ingestion sharding limits

Prometheus default remote_wrtie config

remote_write:
- url: "https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push"
  # [ default config ]
  # remote_timeout: 30s
  # queue_config:
  #   capacity: 2500
  #   max_shards: 200
  #   min_shards: 1
  #   max_samples_per_send: 500
  #   batch_send_deadline: 5s

With the above default configuration, Prometheus is able to send 5000/sample_per_shard within 10sec [refer the calculations here: Remote write stop sending samples ]

Cortex shard limits config : By default we haven't set any limit for ingestion rate. below are the default values:

  limits: [defaults]
    # Per-user ingestion rate limit in samples per second.
    ingestion_rate: 25000
    # Per-user allowed ingestion burst size (in number of samples).
    ingestion_burst_size: 50000
    # ingestion shard size set to 0. disable shard shuffling
    ingestion_tenant_shard_size: 0
    # Maximum number of split queries will be scheduled in parallel by the frontend.
    max_query_parallelism: 14
  distributor:
    instance_limits:
      max_ingestion_rate: 0  # unlimited ingestion
  ingester:
    instance_limits:
      max_ingestion_rate: 0  # unlimited ingestion

Since, Shard configuration is not set, Prometheus is not able to send the data intermittently.

Solution: With below configuration, Prometheus is able to send data most of the time. Need to keep monitor. Below changes are applied on ENG environment and working fine.

  limits:
    ingestion_rate: 100000
    ingestion_burst_size: 100500
    ingestion_tenant_shard_size: 2500
    # Maximum number of split queries will be scheduled in parallel by the frontend.
    max_query_parallelism: 50
  distributor:
    sharding_strategy: "shuffle-sharding"
    shard_by_all_labels: true
    instance_limits:
      max_ingestion_rate: 100000
  ingester:
    instance_limits:
      max_ingestion_rate: 100000

Kafka Cluster Prometheus Logs [After change implemented]:

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=62 to=16

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=16 to=11

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=11 to=7

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=7 to=4

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=4 to=2

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=2 to=1

Nginx-ingress controller logs:

10.240.3.54 - - [23/Dec/2022:11:58:56 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.202" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000]  200 "GET /prometheus/api/v1/query_range?end=1671796680&query=sum%28avg_over_time%28nginx_ingress_controller_nginx_process_connections%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2Cstate%3D%22active%22%7D%5B2m%5D%29%29&start=1671785880&step=120 HTTP/1.1" 460 "-" "Grafana/9.3.1" "-" AT48725
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000]  200 "GET /prometheus/api/v1/query?query=nginx_ingress_controller_config_last_reload_successful%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2C+controller_class%3D~%22k8s.io%2Fingress-nginx%22%7D&time=1671796680 HTTP/1.1" 254 "-" "Grafana/9.3.1" "-" AT48725
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000]  200 "GET /prometheus/api/v1/query_range?end=1671796725&query=round%28sum%28irate%28nginx_ingress_controller_requests%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2Cingress%3D~%22.%2A%22%7D%5B2m%5D%29%29+by+%28ingress%29%2C+0.001%29&start=1671785925&step=15 HTTP/1.1" 2865 "-" "Grafana/9.3.1" "-" AT48725
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000]  200 "GET /prometheus/api/v1/query_range?end=1671796720&query=avg%28nginx_ingress_controller_nginx_process_resident_memory_bytes%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%7D%29+&start=1671785920&step=20 HTTP/1.1" 2238 "-" "Grafana/9.3.1" "-" AT48725
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098

Outcome: With this configuration, cortex Nginx issue has been resolved.

Reference:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prometheus Debugging when Integrated with Cortex and Grafana

Prometheus Error Logs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally