-
Notifications
You must be signed in to change notification settings - Fork 0
Prometheus Debugging when Integrated with Cortex and Grafana
This document will list few types of errors and provide information about the related configuration to update/investigate further.
Error: Skipping resharding, last successful send was beyond threshold
Description: Prometheus deployed on Kafka cluster, was not able send data to cortex instance. Giving below errors:
Kafka Cluster Prometheus Logs:
component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795732 minSendTimestamp=1671795741
component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795761
minSendTimestamp=1671795771
component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Failed to send batch, retrying" err="Post \"https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push\": context deadline exceeded"
component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="Skipping resharding, last successful send was beyond threshold" lastSendTimestamp=1671795761 minSendTimestamp=1671795781
component=remote level=warn url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Failed to send batch, retrying" err="Post \"https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push\": context canceled"
component=remote level=error url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push msg="non-recoverable error" count=6 exemplarCount=0 err="context canceled"
component=remote level=error url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="non-recoverable error" count=5 exemplarCount=0 err="context canceled"
Verifications: It can be seen in Nginx-ingress dashboard shown in "gaps" which represent no data coming at cortex ingress instance.
Rationales:
- Prometheus default
remote_wrtieconfig - Cortex default distributor and ingester ingestion sharding limits
Prometheus default remote_wrtie config
remote_write:
- url: "https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push"
# [ default config ]
# remote_timeout: 30s
# queue_config:
# capacity: 2500
# max_shards: 200
# min_shards: 1
# max_samples_per_send: 500
# batch_send_deadline: 5s
With the above default configuration, Prometheus is able to send 5000/sample_per_shard within 10sec [refer the calculations here: Remote write stop sending samples ]
Cortex shard limits config : By default we haven't set any limit for ingestion rate. below are the default values:
limits: [defaults]
# Per-user ingestion rate limit in samples per second.
ingestion_rate: 25000
# Per-user allowed ingestion burst size (in number of samples).
ingestion_burst_size: 50000
# ingestion shard size set to 0. disable shard shuffling
ingestion_tenant_shard_size: 0
# Maximum number of split queries will be scheduled in parallel by the frontend.
max_query_parallelism: 14
distributor:
instance_limits:
max_ingestion_rate: 0 # unlimited ingestion
ingester:
instance_limits:
max_ingestion_rate: 0 # unlimited ingestion
Since, Shard configuration is not set, Prometheus is not able to send the data intermittently.
Solution: With below configuration, Prometheus is able to send data most of the time. Need to keep monitor. Below changes are applied on ENG environment and working fine.
limits:
ingestion_rate: 100000
ingestion_burst_size: 100500
ingestion_tenant_shard_size: 2500
# Maximum number of split queries will be scheduled in parallel by the frontend.
max_query_parallelism: 50
distributor:
sharding_strategy: "shuffle-sharding"
shard_by_all_labels: true
instance_limits:
max_ingestion_rate: 100000
ingester:
instance_limits:
max_ingestion_rate: 100000
Kafka Cluster Prometheus Logs [After change implemented]:
component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Remote storage resharding" from=62 to=16
component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Remote storage resharding" from=16 to=11
component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Remote storage resharding" from=11 to=7
component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Remote storage resharding" from=7 to=4
component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Remote storage resharding" from=4 to=2
component=remote level=info remote_name=ac5e42 url=https://eng.cortex.copdev.azpriv-cloud.ubs.net/api/v1/push
msg="Remote storage resharding" from=2 to=1
Nginx-ingress controller logs:
10.240.3.54 - - [23/Dec/2022:11:58:56 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.202" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000] 200 "GET /prometheus/api/v1/query_range?end=1671796680&query=sum%28avg_over_time%28nginx_ingress_controller_nginx_process_connections%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2Cstate%3D%22active%22%7D%5B2m%5D%29%29&start=1671785880&step=120 HTTP/1.1" 460 "-" "Grafana/9.3.1" "-" AT48725
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000] 200 "GET /prometheus/api/v1/query?query=nginx_ingress_controller_config_last_reload_successful%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2C+controller_class%3D~%22k8s.io%2Fingress-nginx%22%7D&time=1671796680 HTTP/1.1" 254 "-" "Grafana/9.3.1" "-" AT48725
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000] 200 "GET /prometheus/api/v1/query_range?end=1671796725&query=round%28sum%28irate%28nginx_ingress_controller_requests%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%2Cingress%3D~%22.%2A%22%7D%5B2m%5D%29%29+by+%28ingress%29%2C+0.001%29&start=1671785925&step=15 HTTP/1.1" 2865 "-" "Grafana/9.3.1" "-" AT48725
10.240.1.14 - - [23/Dec/2022:11:58:57 +0000] 200 "GET /prometheus/api/v1/query_range?end=1671796720&query=avg%28nginx_ingress_controller_nginx_process_resident_memory_bytes%7Bcontroller_pod%3D~%22.%2A%22%2Ccontroller_class%3D~%22k8s.io%2Fingress-nginx%22%2Ccontroller_namespace%3D~%22ingress-nginx%22%7D%29+&start=1671785920&step=20 HTTP/1.1" 2238 "-" "Grafana/9.3.1" "-" AT48725
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
10.240.3.54 - - [23/Dec/2022:11:58:57 +0000] 200 "POST /api/v1/push HTTP/1.1" 0 "-" "Prometheus/2.38.0" "10.190.68.200" AT43098
Outcome: With this configuration, cortex Nginx issue has been resolved.
Reference: