-
Notifications
You must be signed in to change notification settings - Fork 0
CoreDNS issue Debugging
Description: The below pointers will briefly describe the rare scenario.
-
Backend Status: COP specific Kubernetes deployments arrive in undesired states as listed below:
- Cortex: Internal components such as Ingester, Compactor, and Store-Gateway pods went into crashLoopBackOff states.
- Grafana: Pods running but logs show errors while resolving queries.
- Prometheus: Same as Grafana.
- Frontend Status: The Grafana dev/eng instance is not loading the required Dashboards. After some time it throws giving 502 gateway error response.
Initial steps are taken as below:
Check Grafana pod logs
Grafana pod logs show it's not able problem at resolving dns for cortex-nginx service. Cortex-nginx is an endpoint where Grafana looks for the data by running the queries written in the panel.
Grafana Data source container logs: Failed to establish a new connection
$ kubectl logs grafana-75595448f6-6cgnx -c grafana-sc-datasources
"msg": "Retrying (Retry(total=4, connect=9, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcbfef279a0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/datasources/reload"}
{"time": "2023-01-05T22:34:15.604344+00:00", "level": "WARNING",
"msg": "Retrying (Retry(total=3, connect=8, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcbfef26c80>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/datasources/reload"}
{"time": "2023-01-05T22:34:20.008211+00:00", "level": "WARNING",
"msg": "Retrying (Retry(total=2, connect=7, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcbfef26a70>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/datasources/reload"}
Grafana Pod logs: i/o timeout
$ kubectl logs grafana-75595448f6-6cgnx -c grafana
logger=ngalert.multiorg.alertmanager t=2023-01-06T06:36:45.431409844Z level=error
msg="error while synchronizing Alertmanager orgs" error="dial tcp: lookup at48725-pgscuscopcope1-dev.postgres.database.azure.com on 10.242.0.10:53: read udp 10.240.0.96:59110->10.242.0.10:53: i/o timeout"
logger=ngalert.sender.router t=2023-01-06T06:36:45.431524246Z level=error
msg="Unable to sync admin configuration" error="dial tcp: lookup at48725-pgscuscopcope1-dev.postgres.database.azure.com on 10.242.0.10:53: read udp 10.240.0.96:59110->10.242.0.10:53: i/o timeout"
logger=provisioning.dashboard type=file name=at48725 t=2023-01-06T06:36:45.431949654Z level=error
msg="failed to search for dashboards" error="dial tcp: lookup at48725-pgscuscopcope1-dev.postgres.database.azure.com on 10.242.0.10:53: read udp 10.240.0.96:59110->10.242.0.10:53: i/o timeout"
Cortex Pods Status
NAME READY STATUS RESTARTS AGE
cortex-compactor-0 0/1 CrashLoopBackOff 33 (6m37s ago) 3h59m
cortex-distributor-b5b979db4-c7hj4 0/1 Running 1 (54s ago) 14m
...
cortex-ingester-2 0/1 CrashLoopBackOff 33 (5m31s ago) 3h59m
...
cortex-store-gateway-0 0/1 CrashLoopBackOff 33 (3m28s ago) 3h59m
Cortex compactor logs
level=error ts=2023-01-12T13:09:54.11880449Z caller=cortex.go:434 msg="module failed" module=compactor err="invalid service state: Failed,
expected: Running,
failure: failed to create bucket client: Azure API return unexpected error: *azblob.InternalError:
===== INTERNAL ERROR =====
Get \"https://strgcusdevcopcopd1.blob.core.windows.net/cortex?restype=container\":disappointed: dial tcp: lookup strgcusdevcopcopd1.blob.core.windows.net: i/o timeout"
Check cortex-nginx logs
cortex-nginx is a reverse proxy running for Cortex to serve the GET/POST requests. Cortex-nginx serves grafana's query via forwarding it to Query-frontend. Grafana must resolve nginx service at cortex-nginx.cortex.svc.cluster.local to process the query.
Cloud Manager pod [kube-system] logs
W1121 reflector.go:442] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server
("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W1210 reflector.go:442] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server
("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W1210 reflector.go:442] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server
("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
Gateway_keeper controller logs
2023/01/12 14:59:00 http: TLS handshake error from 10.240.3.29:47860: EOF
2023/01/12 14:59:00 http: TLS handshake error from 10.240.3.29:47858: EOF
2023/01/12 14:59:00 http: TLS handshake error from 10.240.3.29:47876: EOF
Metrics server status and logs
metrics-server-7885cfdcc6-2pchb 2/2 Running 13 (2d11h ago) 2d12h
metrics-server-7885cfdcc6-mwdvk 1/2 CrashLoopBackOff 1037 (2m54s ago) 2d12h
#Logs
panic: unable to load configmap based request-header-client-ca-file: Get "https://kdc9ae48725cope1k8s-920a0e7c.hcp.centralus.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp: i/o timeout
Microsoft Defender status and logs
microsoft-defender-publisher-ds-nlffd 1/1 Running 4 (2d12h ago) 2d12h
microsoft-defender-publisher-ds-zxpg7 0/1 CrashLoopBackOff 585 (54s ago) 2d12h
#Logs
....
{"azureResourceID":"/subscriptions/xxxxxx-xxxxxx-xxxxxxxx-xxxxxx/resourceGroups/RG-CUS-DEV-COP-ENG/providers/Microsoft.ContainerService/managedClusters/kdc9ae48725cope1","chartVersion":"Unknown","clusterDistribution":"AKS","componentName":"Publisher","componentVersion":"mcr.microsoft.com/azuredefender/stable/security-publisher:1.0.56","envTime":"2023-01-09T09:11:23Z","message":"error encountered during client initializationPost \"https://3fa27958-621a-494a-9c42-10d6c906dabf.oms.opinsights.azure.com/AgentService.svc/LinuxAgentTopologyRequest\": dial tcp: lookup 3fa27958-621a-494a-9c42-10d6c906dabf.oms.opinsights.azure.com on 10.242.0.10:53: read udp 10.240.5.2:59033-\u003e10.242.0.10:53: i/o timeout","nodeName":"aks-sysnpl-31615245-vmss00000q","region":"centralus","releaseTrain":"stable","traceLevel":"error","type":"Trace"}
/opt/microsoft/microsoft-defender-for-cloud/main.sh: line 22: 664 Aborted (core dumped) /opt/td-agent-bit/bin/td-agent-bit -c /opt/microsoft/microsoft-defender-for-cloud/td-agent-bit.conf -e /opt/microsoft/microsoft-defender-for-cloud/plugin_connector.so 2>
Perform DNS debugging
Deployed dnsutils pod, and used nslookup for services as shonwn below:
$ kubectl exec -i -t dnsutils -- nslookup cortex/cortex-ingester
Unable to use a TTY - input is not a terminal or the right kind of file
;; connection timed out; no servers could be reached
command terminated with exit code 1
$ kubectl exec -i -t dnsutils -- nslookup 10.243.134.50
Unable to use a TTY - input is not a terminal or the right kind of file
;; connection timed out; no servers could be reached
command terminated with exit code 1
Where do these errors lead to?
These logs lead to a conclusion that the AKS DNS resolution is not working as expected.
Below solution can be a possible candidate in the current scenario.
- Restart the AKS cluster
- Restart Nodes
- If possible upgrade the AKs cluster with latest version
We choose to upgrade the AKS cluster to the latest version available in the region. Upgrade itself updates the Node config and restarts them which leads to correcting the core DNS functioning.
As a result, all the pods are now able to connect to each other and the Grafana dashboard is showing proper data fetching from Cortex.
- Need to prepare a mechanism where we keep checking this error continuously
- The moment we get this issue
- New node pools should be created automatically
- Workload should be drained and moved to new node pools