Solved

Deployment issue with Karbon 2.3 and Calico CNI

  • 17 October 2021
  • 5 replies
  • 499 views

Badge

Hi there,

It seems that I have a failure near the very end of deployment. Here is the output from karbon_core.out

2021-10-17T11:57:31.052Z kube_prometheus.go:1016: [DEBUG] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:33.093Z kube_prometheus.go:1016: [DEBUG] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:35.135Z kube_prometheus.go:1016: [DEBUG] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.806Z calico.go:552: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] Failed to verify calico addon
2021-10-17T11:57:36.806Z k8s_deploy.go:1478: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] Failed to deploy calico/flannel: Failed to deploy calico: Failed to verify calico: Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.806Z k8s_deploy.go:155: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.832Z k8s_lib_deploy_task.go:112: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] failed to deploy K8s cluster: failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.832Z k8s_lib_deploy_task.go:78: [INFO] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] token refresher received stopRefresh
2021-10-17T11:57:36.844Z deploy_k8s_task.go:364: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] Cluster RGS-PA-K8-CLUSTER-STAGING:failed to deploy K8s cluster: failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.844Z deploy_k8s_task.go:370: [INFO] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] Wait for subtasks to finish before completing parent task

 

If anyone would have any insight how to resolve this, much appreciated!

icon

Best answer by JoseNutanix 18 October 2021, 10:28

View original

This topic has been closed for comments

5 replies

Badge

Forgot to post, the pods status:

NAMESPACE     NAME                                                                  READY   STATUS             RESTARTS   AGE
kube-system calico-kube-controllers-7f66766f7f-nd8sx 1/1 Running 1 74m
kube-system calico-node-2ctb4 1/1 Running 0 74m
kube-system calico-node-7fx7n 1/1 Running 0 74m
kube-system calico-node-bvct7 1/1 Running 1 74m
kube-system calico-node-fjwjp 0/1 CrashLoopBackOff 23 74m
kube-system calico-node-xth2k 1/1 Running 0 74m
kube-system calico-typha-6bfd55df7-ptc7d 1/1 Running 0 74m
kube-system kube-apiserver-karbon-rgs-pa-k8-cluster-staging-e77682-k8s-master-0 3/3 Running 0 77m
kube-system kube-apiserver-karbon-rgs-pa-k8-cluster-staging-e77682-k8s-master-1 3/3 Running 0 77m
kube-system kube-proxy-ds-dsd5v 1/1 Running 0 74m
kube-system kube-proxy-ds-gnng4 1/1 Running 0 74m
kube-system kube-proxy-ds-ph68q 1/1 Running 0 74m
kube-system kube-proxy-ds-tf4ml 1/1 Running 0 74m
kube-system kube-proxy-ds-whbpl

 

Userlevel 4
Badge +5

Hi Igor,

The operation is timing out. You’ll have to check if there is enough bandwidth between sites to pull the images.

Also, you can check the logs for the pod calico-node-fjwjp and see if it downloaded the image, and if it did, then why Calico is crashing. 

Badge

Hi,

Yes, bandwidth is just fine … did some basic testing and all K8 based VMs initialised just fine. It’s just weird that his particular pod can’t initialise Calico network hence the Karbon deployment fails. The Karvon cluster is not removed though (automatically) so there is a chance to look around.

For the pod calico-node-fjwjp 

kube-system   calico-node-fjwjp                                                     0/1     CrashLoopBackOff   327        19h
 

It’s constantly restarting as one would expect as readiness state is not reached.

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  12m (x2224 over 19h)    kubelet  Readiness probe failed: calico/node is not ready: BIRD is not ready: Failed to stat() nodename file: stat /var/lib/calico/nodename: no such file or directory
  Warning  BackOff    2m46s (x3945 over 19h)  kubelet  Back-off restarting failed container

 

Full output from pod describe:

Name:                 calico-node-fjwjp
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: karbon-rgs-pa-k8-cluster-staging-e77682-k8s-worker-0/10.20.25.73
Start Time: Sun, 17 Oct 2021 11:47:36 +0000
Labels: controller-revision-hash=547955649b
k8s-app=calico-node
pod-template-generation=1
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Running
IP: 10.20.25.73
IPs:
IP: 10.20.25.73
Controlled By: DaemonSet/calico-node
Init Containers:
upgrade-ipam:
Container ID: docker://025878de4f3ab420bdc8d572c1037ff591c892f32b1607c1f60f523c398db8de
Image: quay.io/karbon/cni:v3.14.0
Image ID: docker-pullable://quay.io/karbon/cni@sha256:cc951ccd15aa8c94b1b3eec673e434853f3bf8c2deb83bdb4a3f934c68e0e8ae
Port: <none>
Host Port: <none>
Command:
/opt/cni/bin/calico-ipam
-upgrade
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 17 Oct 2021 11:47:45 +0000
Finished: Sun, 17 Oct 2021 11:47:45 +0000
Ready: True
Restart Count: 0
Environment:
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/lib/cni/networks from host-local-net-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-x5lvc (ro)
install-cni:
Container ID: docker://455ed002c1d8450e362fca773854f54000022d29a11401c3943d00d691060827
Image: quay.io/karbon/cni:v3.14.0
Image ID: docker-pullable://quay.io/karbon/cni@sha256:cc951ccd15aa8c94b1b3eec673e434853f3bf8c2deb83bdb4a3f934c68e0e8ae
Port: <none>
Host Port: <none>
Command:
/install-cni.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 17 Oct 2021 11:47:47 +0000
Finished: Sun, 17 Oct 2021 11:47:47 +0000
Ready: True
Restart Count: 0
Environment:
CNI_CONF_NAME: 10-calico.conflist
CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false
KUBERNETES_NODE_NAME: (v1:spec.nodeName)
CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
SLEEP: false
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-x5lvc (ro)
flexvol-driver:
Container ID: docker://68f392f6d3bde62f14185fb50c6b4109982bd63ac060ccbadc18522e84fdc60b
Image: quay.io/karbon/pod2daemon-flexvol:v3.14.0
Image ID: docker-pullable://quay.io/karbon/pod2daemon-flexvol@sha256:e5f2c2b9e67ec463ef5b538b8bf10453cc6a6538f7288a4760ee925c51498e7d
Port: <none>
Host Port: <none>
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 17 Oct 2021 11:47:51 +0000
Finished: Sun, 17 Oct 2021 11:47:51 +0000
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/host/driver from flexvol-driver-host (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-x5lvc (ro)
Containers:
calico-node:
Container ID: docker://96fa1881578bd5bae774a6f25ffc108882413ef44acb6c8e450cf6b38345aa8d
Image: quay.io/karbon/node:v3.14.0
Image ID: docker-pullable://quay.io/karbon/node@sha256:1a643541c4d76ea412dde19454bfada5a7e03e7cbb51ddf76def9baf84bdad7c
Port: <none>
Host Port: <none>
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Mon, 18 Oct 2021 07:41:15 +0000
Finished: Mon, 18 Oct 2021 07:42:24 +0000
Ready: False
Restart Count: 327
Requests:
cpu: 250m
Liveness: exec [/bin/calico-node -felix-live] delay=10s timeout=1s period=10s #success=1 #failure=6
Readiness: exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
DATASTORE_TYPE: kubernetes
FELIX_TYPHAK8SSERVICENAME: <set to the key 'typha_service_name' of config map 'calico-config'> Optional: false
WAIT_FOR_DATASTORE: true
NODENAME: (v1:spec.nodeName)
CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false
CLUSTER_TYPE: k8s,bgp
IP: autodetect
CALICO_IPV4POOL_IPIP: Never
IP_AUTODETECTION_METHOD: interface=eth.*
FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false
CALICO_IPV4POOL_CIDR: 172.20.0.0/16
CALICO_ADVERTISE_CLUSTER_IPS: 172.19.0.0/16
CALICO_DISABLE_FILE_LOGGING: true
FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT
FELIX_IPV6SUPPORT: false
FELIX_LOGSEVERITYSCREEN: info
FELIX_HEALTHENABLED: true
FELIX_PROMETHEUSGOMETRICSENABLED: false
FELIX_PROMETHEUSMETRICSENABLED: true
Mounts:
/lib/modules from lib-modules (ro)
/run/xtables.lock from xtables-lock (rw)
/var/lib/calico from var-lib-calico (rw)
/var/run/calico from var-run-calico (rw)
/var/run/nodeagent from policysync (rw)
/var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-x5lvc (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType:
var-run-calico:
Type: HostPath (bare host directory volume)
Path: /var/run/calico
HostPathType:
var-lib-calico:
Type: HostPath (bare host directory volume)
Path: /var/lib/calico
HostPathType:
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType: FileOrCreate
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/hyperkube/opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
host-local-net-dir:
Type: HostPath (bare host directory volume)
Path: /var/lib/cni/networks
HostPathType:
policysync:
Type: HostPath (bare host directory volume)
Path: /var/run/nodeagent
HostPathType: DirectoryOrCreate
flexvol-driver-host:
Type: HostPath (bare host directory volume)
Path: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
HostPathType: DirectoryOrCreate
calico-node-token-x5lvc:
Type: Secret (a volume populated by a Secret)
SecretName: calico-node-token-x5lvc
Optional: false
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
Tolerations: :NoSchedule op=Exists
:NoExecute op=Exists
CriticalAddonsOnly op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 14m (x2224 over 19h) kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Failed to stat() nodename file: stat /var/lib/calico/node
name: no such file or directory
Warning BackOff 4m24s (x3945 over 19h) kubelet Back-off restarting failed container

 

Userlevel 4
Badge +5

Hi Igor,

I suggest you open a ticket with support, so they can investigate why this pod is crashing (saw calico-node-bvct7 crashing once too). Two of three containers in this pod are alive, with calico-node crashing not finding /var/lib/calico/nodename. Usually this is sort of issues are related to network/performance issues. 

Badge

Hi Jose,

Yes, that’s fine - just to figure out how to raise the ticket for support, as I never had a pleasure to use it in the past  :blush:

Yes, it seems something is off with the particular worker node (10.20.25.73) and pods belonging there and communicating via kubelet and not strictly from calico nodes:

igor.stankovic@rgs-pa-bastion-1:~$ kubectl -n kube-system logs -f kube-proxy-ds-whbpl 
Error from server: Get "https://10.20.25.73:10250/containerLogs/kube-system/kube-proxy-ds-whbpl/kube-proxy?follow=true": dial tcp 10.20.25.73:10250: i/o timeout
igor.stankovic@rgs-pa-bastion-1:~$ 

 

We tried to reboot the kubelet, docker then full recycle for the VM node but still the same. 

It would be interesting to hear from the support.