Deployment issue with Karbon 2.3 and Calico CNI | Nutanix Community
Skip to main content
Solved

Deployment issue with Karbon 2.3 and Calico CNI


Forum|alt.badge.img

Hi there,

It seems that I have a failure near the very end of deployment. Here is the output from karbon_core.out

2021-10-17T11:57:31.052Z kube_prometheus.go:1016: [DEBUG] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:33.093Z kube_prometheus.go:1016: [DEBUG] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:35.135Z kube_prometheus.go:1016: [DEBUG] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.806Z calico.go:552: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] Failed to verify calico addon
2021-10-17T11:57:36.806Z k8s_deploy.go:1478: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] Failed to deploy calico/flannel: Failed to deploy calico: Failed to verify calico: Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.806Z k8s_deploy.go:155: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.832Z k8s_lib_deploy_task.go:112: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] failed to deploy K8s cluster: failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.832Z k8s_lib_deploy_task.go:78: [INFO] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] token refresher received stopRefresh
2021-10-17T11:57:36.844Z deploy_k8s_task.go:364: [ERROR] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] Cluster RGS-PA-K8-CLUSTER-STAGING:failed to deploy K8s cluster: failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 4
2021-10-17T11:57:36.844Z deploy_k8s_task.go:370: [INFO] [k8s_cluster=RGS-PA-K8-CLUSTER-STAGING] Wait for subtasks to finish before completing parent task

 

If anyone would have any insight how to resolve this, much appreciated!

Best answer by JoseNutanix

Hi Igor,

I suggest you open a ticket with support, so they can investigate why this pod is crashing (saw calico-node-bvct7 crashing once too). Two of three containers in this pod are alive, with calico-node crashing not finding /var/lib/calico/nodename. Usually this is sort of issues are related to network/performance issues. 

View original
Did this topic help you find an answer to your question?
This topic has been closed for comments

5 replies

Forum|alt.badge.img
  • Author
  • Adventurer
  • 5 replies
  • October 17, 2021

Forgot to post, the pods status:

NAMESPACE     NAME                                                                  READY   STATUS             RESTARTS   AGE
kube-system   calico-kube-controllers-7f66766f7f-nd8sx                              1/1     Running            1          74m
kube-system   calico-node-2ctb4                                                     1/1     Running            0          74m
kube-system   calico-node-7fx7n                                                     1/1     Running            0          74m
kube-system   calico-node-bvct7                                                     1/1     Running            1          74m
kube-system   calico-node-fjwjp                                                     0/1     CrashLoopBackOff   23         74m
kube-system   calico-node-xth2k                                                     1/1     Running            0          74m
kube-system   calico-typha-6bfd55df7-ptc7d                                          1/1     Running            0          74m
kube-system   kube-apiserver-karbon-rgs-pa-k8-cluster-staging-e77682-k8s-master-0   3/3     Running            0          77m
kube-system   kube-apiserver-karbon-rgs-pa-k8-cluster-staging-e77682-k8s-master-1   3/3     Running            0          77m
kube-system   kube-proxy-ds-dsd5v                                                   1/1     Running            0          74m
kube-system   kube-proxy-ds-gnng4                                                   1/1     Running            0          74m
kube-system   kube-proxy-ds-ph68q                                                   1/1     Running            0          74m
kube-system   kube-proxy-ds-tf4ml                                                   1/1     Running            0          74m
kube-system   kube-proxy-ds-whbpl 

 


JoseNutanix
Nutanix Employee
Forum|alt.badge.img+5
  • Nutanix Employee
  • 150 replies
  • October 18, 2021

Hi Igor,

The operation is timing out. You’ll have to check if there is enough bandwidth between sites to pull the images.

Also, you can check the logs for the pod calico-node-fjwjp and see if it downloaded the image, and if it did, then why Calico is crashing. 


Forum|alt.badge.img
  • Author
  • Adventurer
  • 5 replies
  • October 18, 2021

Hi,

Yes, bandwidth is just fine … did some basic testing and all K8 based VMs initialised just fine. It’s just weird that his particular pod can’t initialise Calico network hence the Karbon deployment fails. The Karvon cluster is not removed though (automatically) so there is a chance to look around.

For the pod calico-node-fjwjp 

kube-system   calico-node-fjwjp                                                     0/1     CrashLoopBackOff   327        19h
 

It’s constantly restarting as one would expect as readiness state is not reached.

Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  12m (x2224 over 19h)    kubelet  Readiness probe failed: calico/node is not ready: BIRD is not ready: Failed to stat() nodename file: stat /var/lib/calico/nodename: no such file or directory
  Warning  BackOff    2m46s (x3945 over 19h)  kubelet  Back-off restarting failed container

 

Full output from pod describe:

Name:                 calico-node-fjwjp
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 karbon-rgs-pa-k8-cluster-staging-e77682-k8s-worker-0/10.20.25.73
Start Time:           Sun, 17 Oct 2021 11:47:36 +0000
Labels:               controller-revision-hash=547955649b
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          scheduler.alpha.kubernetes.io/critical-pod: 
Status:               Running
IP:                   10.20.25.73
IPs:
  IP:           10.20.25.73
Controlled By:  DaemonSet/calico-node
Init Containers:
  upgrade-ipam:
    Container ID:  docker://025878de4f3ab420bdc8d572c1037ff591c892f32b1607c1f60f523c398db8de
    Image:         quay.io/karbon/cni:v3.14.0
    Image ID:      docker-pullable://quay.io/karbon/cni@sha256:cc951ccd15aa8c94b1b3eec673e434853f3bf8c2deb83bdb4a3f934c68e0e8ae
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/calico-ipam
      -upgrade
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 17 Oct 2021 11:47:45 +0000
      Finished:     Sun, 17 Oct 2021 11:47:45 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      KUBERNETES_NODE_NAME:        (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:  <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/lib/cni/networks from host-local-net-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-x5lvc (ro)
  install-cni:
    Container ID:  docker://455ed002c1d8450e362fca773854f54000022d29a11401c3943d00d691060827
    Image:         quay.io/karbon/cni:v3.14.0
    Image ID:      docker-pullable://quay.io/karbon/cni@sha256:cc951ccd15aa8c94b1b3eec673e434853f3bf8c2deb83bdb4a3f934c68e0e8ae
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 17 Oct 2021 11:47:47 +0000
      Finished:     Sun, 17 Oct 2021 11:47:47 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:         10-calico.conflist
      CNI_NETWORK_CONFIG:    <set to the key 'cni_network_config' of config map 'calico-config'>  Optional: false
      KUBERNETES_NODE_NAME:   (v1:spec.nodeName)
      CNI_MTU:               <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      SLEEP:                 false
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-x5lvc (ro)
  flexvol-driver:
    Container ID:   docker://68f392f6d3bde62f14185fb50c6b4109982bd63ac060ccbadc18522e84fdc60b
    Image:          quay.io/karbon/pod2daemon-flexvol:v3.14.0
    Image ID:       docker-pullable://quay.io/karbon/pod2daemon-flexvol@sha256:e5f2c2b9e67ec463ef5b538b8bf10453cc6a6538f7288a4760ee925c51498e7d
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 17 Oct 2021 11:47:51 +0000
      Finished:     Sun, 17 Oct 2021 11:47:51 +0000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-x5lvc (ro)
Containers:
  calico-node:
    Container ID:   docker://96fa1881578bd5bae774a6f25ffc108882413ef44acb6c8e450cf6b38345aa8d
    Image:          quay.io/karbon/node:v3.14.0
    Image ID:       docker-pullable://quay.io/karbon/node@sha256:1a643541c4d76ea412dde19454bfada5a7e03e7cbb51ddf76def9baf84bdad7c
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Mon, 18 Oct 2021 07:41:15 +0000
      Finished:     Mon, 18 Oct 2021 07:42:24 +0000
    Ready:          False
    Restart Count:  327
    Requests:
      cpu:      250m
    Liveness:   exec [/bin/calico-node -felix-live] delay=10s timeout=1s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/calico-node -felix-ready -bird-ready] delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                     kubernetes
      FELIX_TYPHAK8SSERVICENAME:          <set to the key 'typha_service_name' of config map 'calico-config'>  Optional: false
      WAIT_FOR_DATASTORE:                 true
      NODENAME:                            (v1:spec.nodeName)
      CALICO_NETWORKING_BACKEND:          <set to the key 'calico_backend' of config map 'calico-config'>  Optional: false
      CLUSTER_TYPE:                       k8s,bgp
      IP:                                 autodetect
      CALICO_IPV4POOL_IPIP:               Never
      IP_AUTODETECTION_METHOD:            interface=eth.*
      FELIX_IPINIPMTU:                    <set to the key 'veth_mtu' of config map 'calico-config'>  Optional: false
      CALICO_IPV4POOL_CIDR:               172.20.0.0/16
      CALICO_ADVERTISE_CLUSTER_IPS:       172.19.0.0/16
      CALICO_DISABLE_FILE_LOGGING:        true
      FELIX_DEFAULTENDPOINTTOHOSTACTION:  ACCEPT
      FELIX_IPV6SUPPORT:                  false
      FELIX_LOGSEVERITYSCREEN:            info
      FELIX_HEALTHENABLED:                true
      FELIX_PROMETHEUSGOMETRICSENABLED:   false
      FELIX_PROMETHEUSMETRICSENABLED:     true
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from calico-node-token-x5lvc (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/hyperkube/opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  host-local-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/cni/networks
    HostPathType:  
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  calico-node-token-x5lvc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  calico-node-token-x5lvc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     :NoSchedule op=Exists
                 :NoExecute op=Exists
                 CriticalAddonsOnly op=Exists
                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Warning  Unhealthy  14m (x2224 over 19h)    kubelet  Readiness probe failed: calico/node is not ready: BIRD is not ready: Failed to stat() nodename file: stat /var/lib/calico/node
name: no such file or directory
  Warning  BackOff    4m24s (x3945 over 19h)  kubelet  Back-off restarting failed container

 


JoseNutanix
Nutanix Employee
Forum|alt.badge.img+5
  • Nutanix Employee
  • 150 replies
  • Answer
  • October 18, 2021

Hi Igor,

I suggest you open a ticket with support, so they can investigate why this pod is crashing (saw calico-node-bvct7 crashing once too). Two of three containers in this pod are alive, with calico-node crashing not finding /var/lib/calico/nodename. Usually this is sort of issues are related to network/performance issues. 


Forum|alt.badge.img
  • Author
  • Adventurer
  • 5 replies
  • October 18, 2021

Hi Jose,

Yes, that’s fine - just to figure out how to raise the ticket for support, as I never had a pleasure to use it in the past  :blush:

Yes, it seems something is off with the particular worker node (10.20.25.73) and pods belonging there and communicating via kubelet and not strictly from calico nodes:

igor.stankovic@rgs-pa-bastion-1:~$ kubectl -n kube-system logs -f kube-proxy-ds-whbpl 
Error from server: Get "https://10.20.25.73:10250/containerLogs/kube-system/kube-proxy-ds-whbpl/kube-proxy?follow=true": dial tcp 10.20.25.73:10250: i/o timeout
igor.stankovic@rgs-pa-bastion-1:~$ 

 

We tried to reboot the kubelet, docker then full recycle for the VM node but still the same. 

It would be interesting to hear from the support.