CoreDNS issue Debugging

Issue: Kubernetes CoreDNS not able to resolve internal service name

Description: The below pointers will briefly describe the rare scenario.

Backend Status: COP specific Kubernetes deployments arrive in undesired states as listed below:
- Cortex: Internal components such as Ingester, Compactor, and Store-Gateway pods went into crashLoopBackOff states.
- Grafana: Pods running but logs show errors while resolving queries.
- Prometheus: Same as Grafana.
Frontend Status: The Grafana dev/eng instance is not loading the required Dashboards. After some time it throws giving 502 gateway error response.

Debugging

Initial steps are taken as below:

Check Grafana pod logs

Grafana pod logs show it's not able problem at resolving dns for cortex-nginx service. Cortex-nginx is an endpoint where Grafana looks for the data by running the queries written in the panel.

Grafana Data source container logs: Failed to establish a new connection

$ kubectl logs grafana-75595448f6-6cgnx -c grafana-sc-datasources

"msg": "Retrying (Retry(total=4, connect=9, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcbfef279a0>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/datasources/reload"}
{"time": "2023-01-05T22:34:15.604344+00:00", "level": "WARNING", 
"msg": "Retrying (Retry(total=3, connect=8, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcbfef26c80>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/datasources/reload"}
{"time": "2023-01-05T22:34:20.008211+00:00", "level": "WARNING", 
"msg": "Retrying (Retry(total=2, connect=7, read=5, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fcbfef26a70>: Failed to establish a new connection: [Errno 111] Connection refused')': /api/admin/provisioning/datasources/reload"}

Grafana Pod logs: i/o timeout

$ kubectl logs grafana-75595448f6-6cgnx -c grafana

logger=ngalert.multiorg.alertmanager t=2023-01-06T06:36:45.431409844Z level=error 
msg="error while synchronizing Alertmanager orgs" error="dial tcp: lookup at48725-pgscuscopcope1-dev.postgres.database.azure.com on 10.242.0.10:53: read udp 10.240.0.96:59110->10.242.0.10:53: i/o timeout"
logger=ngalert.sender.router t=2023-01-06T06:36:45.431524246Z level=error 
msg="Unable to sync admin configuration" error="dial tcp: lookup at48725-pgscuscopcope1-dev.postgres.database.azure.com on 10.242.0.10:53: read udp 10.240.0.96:59110->10.242.0.10:53: i/o timeout"
logger=provisioning.dashboard type=file name=at48725 t=2023-01-06T06:36:45.431949654Z level=error 
msg="failed to search for dashboards" error="dial tcp: lookup at48725-pgscuscopcope1-dev.postgres.database.azure.com on 10.242.0.10:53: read udp 10.240.0.96:59110->10.242.0.10:53: i/o timeout"

Cortex Pods Status

NAME                                     READY   STATUS             RESTARTS         AGE
cortex-compactor-0                       0/1     CrashLoopBackOff   33 (6m37s ago)   3h59m
cortex-distributor-b5b979db4-c7hj4       0/1     Running            1 (54s ago)      14m
...
cortex-ingester-2                        0/1     CrashLoopBackOff   33 (5m31s ago)   3h59m
...
cortex-store-gateway-0                   0/1     CrashLoopBackOff   33 (3m28s ago)   3h59m

Cortex compactor logs

level=error ts=2023-01-12T13:09:54.11880449Z caller=cortex.go:434 msg="module failed" module=compactor err="invalid service state: Failed, 
expected: Running, 
failure: failed to create bucket client: Azure API return unexpected error: *azblob.InternalError: 
===== INTERNAL ERROR =====
Get \"https://strgcusdevcopcopd1.blob.core.windows.net/cortex?restype=container\":disappointed: dial tcp: lookup strgcusdevcopcopd1.blob.core.windows.net: i/o timeout"

Check cortex-nginx logs

cortex-nginx is a reverse proxy running for Cortex to serve the GET/POST requests. Cortex-nginx serves grafana's query via forwarding it to Query-frontend. Grafana must resolve nginx service at cortex-nginx.cortex.svc.cluster.local to process the query.

Cloud Manager pod [kube-system] logs

W1121 reflector.go:442] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server 
("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W1210 reflector.go:442] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server 
("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W1210 reflector.go:442] k8s.io/client-go/informers/factory.go:134: watch of *v1.Node ended with: an error on the server 
("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding

Gateway_keeper controller logs

2023/01/12 14:59:00 http: TLS handshake error from 10.240.3.29:47860: EOF
2023/01/12 14:59:00 http: TLS handshake error from 10.240.3.29:47858: EOF
2023/01/12 14:59:00 http: TLS handshake error from 10.240.3.29:47876: EOF

Metrics server status and logs

metrics-server-7885cfdcc6-2pchb                        2/2     Running            13 (2d11h ago)     2d12h
metrics-server-7885cfdcc6-mwdvk                        1/2     CrashLoopBackOff   1037 (2m54s ago)   2d12h

#Logs

panic: unable to load configmap based request-header-client-ca-file: Get "https://kdc9ae48725cope1k8s-920a0e7c.hcp.centralus.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp: i/o timeout

Microsoft Defender status and logs

microsoft-defender-publisher-ds-nlffd                  1/1     Running            4 (2d12h ago)      2d12h
microsoft-defender-publisher-ds-zxpg7                  0/1     CrashLoopBackOff   585 (54s ago)      2d12h

#Logs

....

{"azureResourceID":"/subscriptions/xxxxxx-xxxxxx-xxxxxxxx-xxxxxx/resourceGroups/RG-CUS-DEV-COP-ENG/providers/Microsoft.ContainerService/managedClusters/kdc9ae48725cope1","chartVersion":"Unknown","clusterDistribution":"AKS","componentName":"Publisher","componentVersion":"mcr.microsoft.com/azuredefender/stable/security-publisher:1.0.56","envTime":"2023-01-09T09:11:23Z","message":"error encountered during client initializationPost \"https://3fa27958-621a-494a-9c42-10d6c906dabf.oms.opinsights.azure.com/AgentService.svc/LinuxAgentTopologyRequest\": dial tcp: lookup 3fa27958-621a-494a-9c42-10d6c906dabf.oms.opinsights.azure.com on 10.242.0.10:53: read udp 10.240.5.2:59033-\u003e10.242.0.10:53: i/o timeout","nodeName":"aks-sysnpl-31615245-vmss00000q","region":"centralus","releaseTrain":"stable","traceLevel":"error","type":"Trace"}
/opt/microsoft/microsoft-defender-for-cloud/main.sh: line 22:   664 Aborted                 (core dumped) /opt/td-agent-bit/bin/td-agent-bit -c /opt/microsoft/microsoft-defender-for-cloud/td-agent-bit.conf -e /opt/microsoft/microsoft-defender-for-cloud/plugin_connector.so 2>

Perform DNS debugging

Deployed dnsutils pod, and used nslookup for services as shonwn below:

$ kubectl exec -i -t dnsutils -- nslookup cortex/cortex-ingester
Unable to use a TTY - input is not a terminal or the right kind of file
;; connection timed out; no servers could be reached

command terminated with exit code 1

$ kubectl exec -i -t dnsutils -- nslookup 10.243.134.50
Unable to use a TTY - input is not a terminal or the right kind of file
;; connection timed out; no servers could be reached

command terminated with exit code 1

Where do these errors lead to?

These logs lead to a conclusion that the AKS DNS resolution is not working as expected.

Troubleshooting

Below solution can be a possible candidate in the current scenario.

Restart the AKS cluster
Restart Nodes
If possible upgrade the AKs cluster with latest version

Fix Applied Dev and ENG cluster

We choose to upgrade the AKS cluster to the latest version available in the region. Upgrade itself updates the Node config and restarts them which leads to correcting the core DNS functioning.

As a result, all the pods are now able to connect to each other and the Grafana dashboard is showing proper data fetching from Cortex.

Future Plans

Need to prepare a mechanism where we keep checking this error continuously
The moment we get this issue
- New node pools should be created automatically
- Workload should be drained and moved to new node pools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CoreDNS issue Debugging

Issue: Kubernetes CoreDNS not able to resolve internal service name

Debugging

Troubleshooting

Fix Applied Dev and ENG cluster

Future Plans

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally