Cortex : Configuration & Debugging

This document will list a few types of errors and provide information about the related configuration/component to update/investigate further.

Distributor Ingestion Issues

Error Type 1

level=error caller=push.go:51 org_id=fake 
msg="push error" 
err="rpc error: code = Code(400) 
desc = series has too many labels (actual: 35, limit: 30) 
series: 'container_blkio_device_usage_total{agentpool=\"monnpl1\", beta_kubernetes_io_arch=\"amd64\", beta_kubernetes_io_instance_type=\"Standard_D8s_v3\", beta_kubernetes_io_os=\"linux\", device=\"/dev/sda\", failure_domain_beta_kubernetes_io_region=\"eastus2\", failure_domain_beta_kubernetes_io_zone=\"eastus2-1\", id=\"/kubepods/podcd98135d-5ab0-4acf-b626-ec47209f5aa3\", instance=\"aks-monnpl1-89050294-vmss00000c\", job=\"kubernetes-nodes-cadvisor\", kubernetes_azure_com_agentpool=\"monnpl1\", kubernetes_azure_com_cluster=\"MC_AT43098_DEV_EUS2_AKS_kd2cd843098mon01_eastus2\", kubernetes_azure_com_kubelet_identity_client_id=\"182aca76-7667-4cba-a711-ee5027fcd2b1\", kubernetes_azure_com_mode=\"user\", kubernetes_azure_com_node_image_version=\"AKSUbuntu-1804gen2containerd-2022.07.11\", kubernetes_azure_com_os_sku=\"Ubuntu\", kubernetes_azure_com_role=\"agent\", kubernetes_azure_com_storageprofile=\"managed\", kubernetes_azure_com_storagetier=\"Premium_LRS\", kubernetes_io_arch=\"amd64\", kubernetes_io_hostname=\"aks-monnpl1-89050294-vmss00000c\", kubernetes_io_os=\"linux\", kubernetes_io_role=\"agent\", major=\"8\", minor=\"0\", namespace=\"ubs-system\", node_kubernetes_io_instance_type=\"Standard_D8s_v3\", operation=\"Total\", pod=\"ubstimesync-k4g7r\", storageprofile=\"managed\", storagetier=\"Premium_LRS\", topology_disk_csi_azure_com_zone=\"eastus2-1\", topology_kubernetes_io_region=\"eastus2\", topology_kubernetes_io_zone=\"eastus2-1\"}'"

caller=push.go:51 org_id=fake 
msg="push error" err="rpc error: code = Code(429) 
desc = ingestion rate limit (3.3333333333333335) exceeded while adding 1 samples and 0 metadata"

Error Type 2

level=warn caller=logging.go:72 traceID=4ab91bb8b2f00197 
msg="POST /api/prom/push (500) 108µs 
Response: \"distributor's samples push rate limit reached\\n\" ws: false; Connection: close; 
Content-Encoding: snappy; Content-Length: 2354; Content-Type: application/x-protobuf; User-Agent: Prometheus/2.25.0; X-Prometheus-Remote-Write-Version: 0.1.0; "

Error Type 3

level=error caller=dedupe.go:112 component=remote remote_name=955e57 url=http://10.191.124.201/api/prom/push 
msg="non-recoverable error" count=136 
err="server returned HTTP status 429 Too Many Requests: ingestion rate limit (8333.333333333334) exceeded while adding 136 samples and 0 metadata"
level=warn caller=dedupe.go:112 component=remote remote_name=955e57 url=http://10.191.124.201/api/prom/push 
msg="Failed to send batch, retrying" err="server returned HTTP status 500 Internal Server Error: DoBatch: InstancesCount <= 0"

Configuration Update Required

config.limits.ingestion_rate
distributor.instance_limits.max_ingestion_rate
ingester.instance_limits.max_ingestion_rate

Cortex-Nginx Timeout Issue

Issue: Cortex nginx reporting connection time out when Prometheus deployed on another AKS cluster not able to continue connection.

Error logs:

connect() failed (110: Operation timed out) while connecting to upstream, client: 10.240.4.1, 
server: eng.cortex.copdev.azpriv-cloud.ubs.net, request: "POST /api/v1/push HTTP/2.0", upstream: "http://10.240.2.58:80/api/v1/push", host: "eng.cortex.copdev.azpriv-cloud.ubs.net"

Solution: Implemented Nginx specific timeout annotations as shown below:

    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"

Result: The timeout has been mitigated with the above annotations. Now, we see below warnings.

2022/12/14 06:08:55 [warn] 34#34: *602004 a client request body is buffered to a temporary file /tmp/client-body/0000027603, client: 10.240.4.1, 
server: eng.cortex.copdev.azpriv-cloud.ubs.net, 
request: "POST /api/v1/push HTTP/2.0", host: "eng.cortex.copdev.azpriv-cloud.ubs.net"
2022/12/14 06:09:23 [warn] 34#34: *602004 a client request body is buffered to a temporary file /tmp/client-body/0000027604, client: 10.240.4.1, server: eng.cortex.copdev.azpriv-cloud.ubs.net, request: "POST /api/v1/push HTTP/2.0", host: "eng.cortex.copdev.azpriv-cloud.ubs.net"

What's the meaning of the warning ? It means that the size of the uploaded file was larger than the in-memory buffer reserved for uploads.

Suggestions from Web !! If you can afford to have 1GB of RAM always reserved for the occasional file upload, then that's fine. It's a performance optimization to buffer the upload in RAM rather than in a temporary file on disk, though with such large uploads a couple of extra seconds probably doesn't matter much. If most of your uploads are small, then it's probably a waste. In the end, only we can really make the decision as to what the appropriate size is.

References:

Issue: bucket index is too old and the last time it was updated exceeds the allowed max staleness

Description: Querier needs to have an almost up-to-date view over the entire storage bucket, in order to find the right blocks to lookup at query time. The querier can keep the bucket view updated in to two different ways:

Periodically scanning the bucket (default)
Periodically downloading the bucket index

When Bucket index enabled

When bucket index is enabled, queriers lazily download the bucket index upon the first query received for a given tenant, cache it in memory and periodically keep it update. The bucket index contains the list of blocks and block deletion marks of a tenant, which is later used during the query execution to find the set of blocks that need to be queried for the given query.

Given the bucket index removes the need to scan the bucket, it brings few benefits:

The querier is expected to be ready shortly after startup.
Lower volume of API calls to object storage.

Reference Document: Cortex Querier Functionality Deep Dive

When did the issue occurred: Its expected to be occurred when we faced coreDNS internal connection issue. Due to which Querier-frontend was not able to get the latest bucket index from querier and sending the error that its looking at the querier for latest data for current query.

Pod Error Logs

Querier Frontend Pod

level=error ts=2023-01-12T07:12:26.259317318Z caller=retry.go:79 org_id=AT48725 
msg="error processing request" try=9 err="rpc error: code = Code(500) 
desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: bucket index is too old and the last time it was updated exceeds the allowed max staleness\"}"
level=warn ts=2023-01-12T07:12:26.26040895Z caller=logging.go:86 traceID=68043a8673d055bc 
msg="GET /prometheus/api/v1/query_range?end=1673507400&query=sum%28kube_pod_container_resource_requests%7Bunit%3D%22core%22%7D%29+%2F+sum%28machine_cpu_cores%29&start=1672902600&step=1200 (500) 76.31334ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"internal\\\",\\\"error\\\":\\\
"expanding series: bucket index is too old and the last time it was updated exceeds the allowed max staleness\\\"}\" ws: false; Accept-Encoding: gzip; Connection: close; User-Agent: Grafana/9.3.1; X-Scope-Orgid: AT48725; "

Config Fix

Update the configuration as below:

[block storage config]

# True to enable TSDB WAL compression.
# CLI flag: -blocks-storage.tsdb.wal-compression-enabled
[wal_compression_enabled: <boolean> | default = false]

# True to flush blocks to storage on shutdown. If false, incomplete blocks
# will be reused after restart.
# CLI flag: -blocks-storage.tsdb.flush-blocks-on-shutdown
[flush_blocks_on_shutdown: <boolean> | default = false]

# True to enable snapshotting of in-memory TSDB data on disk when shutting
# down.
# CLI flag: -blocks-storage.tsdb.memory-snapshot-on-shutdown
[memory_snapshot_on_shutdown: <boolean> | default = false]

[Querier config]

# Maximum lookback beyond which queries are not sent to ingester. 0 means all
# queries are sent to ingester.
# CLI flag: -querier.query-ingesters-within
[query_ingesters_within: <duration> | default = 0s]

# Time since the last sample after which a time series is considered stale and
# ignored by expression evaluations.
# CLI flag: -querier.lookback-delta
[lookback_delta: <duration> | default = 5m]

Also added compactor fix to autoscale

[Compactor]

   # Time before a block marked for deletion is deleted from bucket. If not 0,
   # blocks will be marked for deletion and compactor component will permanently
   # delete blocks marked for deletion from the bucket. If 0, blocks will be
   # deleted straight away. Note that deleting blocks immediately can cause query
   # failures.
   # CLI flag: -compactor.deletion-delay
   [deletion_delay: <duration> | default = 12h]
   
   # Shard tenants across multiple compactor instances. Sharding is required if you
   # run multiple compactor instances, in order to coordinate compactions and avoid
   # race conditions leading to the same tenant blocks simultaneously compacted by
   # different instances.
   # CLI flag: -compactor.sharding-enabled
   [sharding_enabled: <boolean> | default = false]
   
   # The sharding strategy to use. Supported values are: default, shuffle-sharding.
   # CLI flag: -compactor.sharding-strategy
   [sharding_strategy: <string> | default = "default"]

After Config update logs

Store-Gateway Logs

level=info ts=2023-01-30T13:34:59.303267316Z caller=gateway.go:322 
msg="synchronizing TSDB blocks for all users" reason=ring-change
level=info ts=2023-01-30T13:34:59.305239349Z caller=bucket.go:544 
org_id=AT48725 msg="dropped outdated block" block=01GP4H95D1KCAE2513VVTH5HAZ
level=info ts=2023-01-30T13:34:59.305505354Z caller=bucket.go:544 
org_id=AT48725 msg="dropped outdated block" block=01GP4H949NZMXMK0KHWKZ1CYGE
level=info ts=2023-01-30T13:34:59.305778858Z caller=bucket.go:544 
org_id=AT48725 msg="dropped outdated block" block=01GP4H96HR6TVNJ4PXHSPQWQZ4
level=info ts=2023-01-30T13:34:59.307414586Z caller=bucket.go:544 
org_id=at48725-cop-eng msg="dropped outdated block" block=01GR1D85BC58FPETXZNDKS1291
level=info ts=2023-01-30T13:34:59.30768059Z caller=bucket.go:544 
org_id=at48725-cop-eng msg="dropped outdated block" block=01GR1DXBEF7EW79FG7ZAWJ6VH3
level=info ts=2023-01-30T13:34:59.308610306Z caller=bucket_stores.go:537 msg="closed bucket store for user" user=AT48725
level=info ts=2023-01-30T13:34:59.308680907Z caller=bucket_stores.go:545 
msg="deleted user sync directory" dir=/data/tsdb-sync/AT48725
level=info ts=2023-01-30T13:34:59.308695007Z caller=gateway.go:328 
msg="successfully synchronized TSDB blocks for all users" reason=ring-change

Compactor Logs

level=info caller=compact.go:1319 component=compactor org_id=AT43098 msg="start of compactions"
level=info caller=compact.go:1355 component=compactor org_id=AT43098 msg="compaction iterations done"
level=info aller=compactor.go:701 component=compactor msg="successfully compacted user blocks" user=AT43098
level=info caller=compactor.go:724 component=compactor msg="deleted directory for user not owned by this shard" dir=data/compactor-meta-at48725-cop-eng
level=info caller=compactor.go:724 component=compactor msg="deleted directory for user not owned by this shard" dir=data/compactor-meta-tenant-2
level=info caller=compactor.go:724 component=compactor msg="deleted directory for user not owned by this shard" dir=data/compactor-meta-wma-cop-eng

Ingester Logs

level=info caller=ingester.go:2311 msg="closed idle TSDB" user=tenant-2
level=info caller=ingester.go:2346 msg="deleted local TSDB, due to being idle" user=tenant-2 dir=/data/tsdb/tenant-2
level=info caller=ingester.go:2311 msg="closed idle TSDB" user=wma-cop-eng
level=info caller=ingester.go:2346 msg="deleted local TSDB, due to being idle" user=wma-cop-eng dir=/data/tsdb/wma-cop-eng
level=info aller=ingester.go:2311 msg="closed idle TSDB" user=tenant-1
level=info caller=ingester.go:2346 msg="deleted local TSDB, due to being idle" user=tenant-1 dir=/data/tsdb/tenant-1
level=info caller=ingester.go:2311 msg="closed idle TSDB" user=AT48725
level=info caller=ingester.go:2346 msg="deleted local TSDB, due to being idle" user=AT48725 dir=/data/tsdb/AT48725
level=info caller=ingester.go:2311 msg="closed idle TSDB" user=AT43098
level=info caller=ingester.go:2346 msg="deleted local TSDB, due to being idle" user=AT43098 dir=/data/tsdb/AT43098
level=info caller=ingester.go:2311 msg="closed idle TSDB" user=fake
level=info caller=ingester.go:2346 msg="deleted local TSDB, due to being idle" user=fake dir=/data/tsdb/fake

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cortex : Configuration & Debugging

Distributor Ingestion Issues

Cortex-Nginx Timeout Issue

Issue: bucket index is too old and the last time it was updated exceeds the allowed max staleness

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally