Skip to content

Prometheus Debugging when Integrated with Cortex and Grafana

vk0125 edited this page Mar 9, 2023 · 1 revision

This document will list few types of errors and provide information about the related configuration to update/investigate further.

Prometheus Error Logs

Error: Skipping resharding, last successful send was beyond threshold

Description: Prometheus deployed on Kafka cluster, was not able send data to cortex instance. Giving below errors:

Kafka Cluster Prometheus Logs:

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795732 minSendTimestamp=1671795741

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795761 
 minSendTimestamp=1671795771

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Failed to send batch, retrying" err="Post \"https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push\": context deadline exceeded"

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795761 minSendTimestamp=1671795781

component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Failed to send batch, retrying" err="Post \"https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push\": context canceled"

component=remote level=error url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="non-recoverable error" count=6 exemplarCount=0 err="context canceled"

component=remote level=error url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="non-recoverable error" count=5 exemplarCount=0 err="context canceled"

Verifications: It can be seen in Nginx-ingress dashboard shown in "gaps" which represent no data coming at cortex ingress instance.

Rationales:

  1. Prometheus default remote_wrtie config
  2. Cortex default distributor and ingester ingestion sharding limits

Prometheus default remote_wrtie config

remote_write:
- url: "https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push"
  # [ default config ]
  # remote_timeout: 30s
  # queue_config:
  #   capacity: 2500
  #   max_shards: 200
  #   min_shards: 1
  #   max_samples_per_send: 500
  #   batch_send_deadline: 5s

With the above default configuration, Prometheus is able to send 5000/sample_per_shard within 10sec [refer the calculations here: Remote write stop sending samples ]

Cortex shard limits config : By default we haven't set any limit for ingestion rate. below are the default values:

  limits: [defaults]
    # Per-user ingestion rate limit in samples per second.
    ingestion_rate: 25000
    # Per-user allowed ingestion burst size (in number of samples).
    ingestion_burst_size: 50000
    # ingestion shard size set to 0. disable shard shuffling
    ingestion_tenant_shard_size: 0
    # Maximum number of split queries will be scheduled in parallel by the frontend.
    max_query_parallelism: 14
  distributor:
    instance_limits:
      max_ingestion_rate: 0  # unlimited ingestion
  ingester:
    instance_limits:
      max_ingestion_rate: 0  # unlimited ingestion

Since, Shard configuration is not set, Prometheus is not able to send the data intermittently.

Solution: With below configuration, Prometheus is able to send data most of the time. Need to keep monitor. Below changes are applied on ENG environment and working fine.

  limits:
    ingestion_rate: 100000
    ingestion_burst_size: 100500
    ingestion_tenant_shard_size: 2500
    # Maximum number of split queries will be scheduled in parallel by the frontend.
    max_query_parallelism: 50
  distributor:
    sharding_strategy: "shuffle-sharding"
    shard_by_all_labels: true
    instance_limits:
      max_ingestion_rate: 100000
  ingester:
    instance_limits:
      max_ingestion_rate: 100000

Kafka Cluster Prometheus Logs [After change implemented]:

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=62 to=16

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=16 to=11

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=11 to=7

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=7 to=4

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=4 to=2

component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push 
msg="Remote storage resharding" from=2 to=1

Nginx-ingress controller logs:

10.240.3.54 - - [23/Dec/2022:11:58:56 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.202" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000]  200 "GET /prometheus/api/v1/query_range?end=1671796680&query=sum%28avg_over_time%28nginx_ingress_controller_nginx_process_connections%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2Cstate%3D%22active%22%7D%5B2m%5D%29%29&start=1671785880&step=120 HTTP/1.1" 460 "-" "Grafana/9.3.1" "-" AT48725
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000]  200 "GET /prometheus/api/v1/query?query=nginx_ingress_controller_config_last_reload_successful%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2C+controller_class%3D~%22k8s.io%2Fingress-nginx%22%7D&time=1671796680 HTTP/1.1" 254 "-" "Grafana/9.3.1" "-" AT48725
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000]  200 "GET /prometheus/api/v1/query_range?end=1671796725&query=round%28sum%28irate%28nginx_ingress_controller_requests%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2Cingress%3D~%22.%2A%22%7D%5B2m%5D%29%29+by+%28ingress%29%2C+0.001%29&start=1671785925&step=15 HTTP/1.1" 2865 "-" "Grafana/9.3.1" "-" AT48725
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000]  200 "GET /prometheus/api/v1/query_range?end=1671796720&query=avg%28nginx_ingress_controller_nginx_process_resident_memory_bytes%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%7D%29+&start=1671785920&step=20 HTTP/1.1" 2238 "-" "Grafana/9.3.1" "-" AT48725
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000]  200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098

Outcome: With this configuration, cortex Nginx issue has been resolved.

Reference:

  1. Prometheus Remote Write tuning
  2. Prometheus remote write Config
  3. Trouble shoot remote write issue in Prometheus

Clone this wiki locally