Question

Karbon 2.3 DVP deployment issues (centralised Prism VPN SIte2Site)

  • 14 October 2021
  • 4 replies
  • 97 views

Badge

Hi there,

We are currently deploying (and testing) Karbon as K8 orchestration platform for all our Nutanix platforms (worldwide). I have failed installation attempts from Prism Central.

As a side note, we are using VPN Site2Site from centralised Prism Control and I can reach remote K8 VLAN from distant Nutanix deployment. The reachout tests were done with small VM belonging the the K8 VLAN testing node and via private CIDR (bidirectional tests), but utilizing the same encrypted VPN channel.

This is what I’m having from karbon_core.out (PCVM):

2021-10-13T21:17:12.687Z ssh.go:153: [DEBUG] [k8s_cluster=RGS-PA-K8-STAGING] On 10.20.25.130:22 executing: docker plugin inspect nutanix
2021-10-13T21:17:12.825Z ssh.go:166: [WARN] [k8s_cluster=RGS-PA-K8-STAGING] Run cmd failed: Failed to run command: on host(10.20.25.130:22) cmd(docker plugin inspect nutanix) error: "Process exited with status 1", output: "Error: No such plugin: nutanix\n[]\n"
2021-10-13T21:17:12.825Z sshutils.go:44: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] Unable to run commands [docker plugin inspect nutanix] on 10.20.25.130: "Failed to configure with SSH: Failed to run command: on host(10.20.25.130:22) cmd(docker plugin inspect nutanix) error: \"Process exited with status 1\""
2021-10-13T21:17:12.825Z install.go:55: [INFO] [k8s_cluster=RGS-PA-K8-STAGING] Unable to check if the dvp is enabled on: 10.20.25.130 failed to check the dvp status: Failed to configure with SSH: Failed to run command: on host(10.20.25.130:22) cmd(docker plugin inspect nutanix) error: "Process exited with status 1"
2021-10-13T21:17:12.825Z install.go:67: [INFO] [k8s_cluster=RGS-PA-K8-STAGING] Installing ntnx dvp on host: 10.20.25.130
2021-10-13T21:17:12.825Z install.go:150: [DEBUG] [k8s_cluster=RGS-PA-K8-STAGING] Command: mkdir -p /etc/docker-plugin-certs && /home/nutanix/docker_plugin/create_plugin_from_tar.sh '/home/nutanix/docker_plugin/dvp.tar.gz' 'nutanix' '10.20.1.10' '10.20.1.50' '' '' 'RGS-PA-K8'
2021-10-13T21:17:12.993Z ssh.go:138: [DEBUG] [k8s_cluster=RGS-PA-K8-STAGING] Copying /etc/docker-plugin-certs/key to 10.20.25.130:22
2021-10-13T21:17:13.157Z ssh.go:138: [DEBUG] [k8s_cluster=RGS-PA-K8-STAGING] Copying /etc/docker-plugin-certs/cert to 10.20.25.130:22
2021-10-13T21:17:13.262Z ssh.go:138: [DEBUG] [k8s_cluster=RGS-PA-K8-STAGING] Copying /etc/docker-plugin-certs/ca.pem to 10.20.25.130:22
2021-10-13T21:17:13.367Z ssh.go:138: [DEBUG] [k8s_cluster=RGS-PA-K8-STAGING] Copying /var/nutanix/host_upgrade/preupgrade-docker-plugin-certs.sh to 10.20.25.130:22
I1013 14:21:57.379644       1 forwarder.go:328] Forwarding MetricDataSampleList to CFS: [timestamp_usecs:1634159700000000 entity_type_name:"acs_stats_table" metric_list:"pc_cluster_uuid" metric_list:"k8s_cluster_uuid" metric_list:"last_val_karbon_version" metric_list:"last_val_cluster_name" metric_list:"last_val_cluster_prefix" metric_list:"last_val_k8s_version" metric_list:"last_val_os_flavor" metric_list:"last_val_etcd_cluster_uuid" metric_list:"last_val_etcd_members_count" metric_list:"last_val_per_etcd_cpu" metric_list:"last_val_per_etcd_mem" metric_list:"last_val_master_deploy_type" metric_list:"last_val_masters_count" metric_list:"last_val_per_master_cpu" metric_list:"last_val_per_master_mem" metric_list:"last_val_workers_count" metric_list:"last_val_per_worker_cpu" metric_list:"last_val_per_worker_mem" metric_list:"last_val_logging_state" metric_list:"last_val_logging_version" metric_list:"last_val_fluentbit_version" metric_list:"last_val_elasticsearch_version" metric_list:"last_val_elasticsearh_image" metric_list:"last_val_kibana_version" metric_list:"last_val_kibana_image" metric_list:"last_val_proxy_used" num_dimensions:2 ]
I1013 14:21:57.385927       1 forwarder.go:339] Received response from PutMetricDataArg: 
2021-10-13T21:22:08.925Z sshutils.go:44: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] Unable to run commands [mkdir -p /etc/docker-plugin-certs && /home/nutanix/docker_plugin/create_plugin_from_tar.sh '/home/nutanix/docker_plugin/dvp.tar.gz' 'nutanix' '10.20.1.10' '10.20.1.50' '' '' 'RGS-PA-K8'] on 10.20.25.15: "Operation timed out"
2021-10-13T21:22:08.925Z install.go:177: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] Error installing ntnx dvp with err: Operation timed out []
2021-10-13T21:22:08.925Z etcd_scale.go:40: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] Failed to install the ntnx dvp on etcd node: 10.20.25.15 with err: failed to deploy the ntnx dvp: Operation timed out
2021-10-13T21:22:08.925Z node_pool_create.go:337: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] Failure in init vm callback: failed to deploy the ntnx dvp: Operation timed out
2021-10-13T21:22:08.925Z node_pool_create.go:251: [DEBUG] [k8s_cluster=RGS-PA-K8-STAGING] Cleaning up failed VM "rgs-pa-k8-staging-d2d032-etcd-2" and its entities

 

It’s a bit unclear to me where the failure happened. If the  DVP plug-in is not present, the next command sequence is part of the installation, but its not clear what failed the installation.

As a note, I did try to use DVP plugin from a testing VM and I could mount the storage container as a volume OK. That node is belonging to the same K8 VLAN when installation failed.

Thanks for any feedback!!

Igor

 

 

 


4 replies

Userlevel 4
Badge +5

Hi Igor,

Do you have any firewall rules between PC and the remote K8s network that its not applying from the computer you are connecting to the test VM to install the DVP?

Also, I’d suggest if possible, to run a test in a local VLAN to make sure the deployment is successful and focus then in the connection between PC and the remote site.

Badge

Hi Jose! many thanks for the reach.

I found the culprit here. In essence, we are running centralised Prism control and it’s linked with site2site  FW based VPN tunnels to other Nutanix platforms - so that management is central.

However, I had to NAT exempt completely private CIDR between the interfaces for the encrypted S2S channel and in particular for inter routing between the Nutanix interfaces. The returned network packets didn’t have originated source IP addresses, so there was a breakdown and this is now fixed.

Curiously, are you planning to support Kubernetes 2.1+ anytime soon for Karbon?

So, Karbon deployment progressed almost to the end, and now I have different problem with Calico, so failed again:

 

2021-10-16T11:51:32.407Z calico.go:552: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] Failed to verify calico addon
2021-10-16T11:51:32.407Z k8s_deploy.go:1478: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] Failed to deploy calico/flannel: Failed to deploy calico: Failed to verify calico: [ Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 2, Operation timed out: expecting 1 available replica of calico-kube-controllers deployment in kube-system namespace. Currently running: 0 ]
2021-10-16T11:51:32.407Z k8s_deploy.go:155: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: [ Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 2, Operation timed out: expecting 1 available replica of calico-kube-controllers deployment in kube-system namespace. Currently running: 0 ]
2021-10-16T11:51:32.432Z k8s_lib_deploy_task.go:112: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] failed to deploy K8s cluster: failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: [ Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 2, Operation timed out: expecting 1 available replica of calico-kube-controllers deployment in kube-system namespace. Currently running: 0 ]
2021-10-16T11:51:32.432Z k8s_lib_deploy_task.go:78: [INFO] [k8s_cluster=RGS-PA-K8-STAGING] token refresher received stopRefresh
2021-10-16T11:51:32.444Z deploy_k8s_task.go:364: [ERROR] [k8s_cluster=RGS-PA-K8-STAGING] Cluster RGS-PA-K8-STAGING:failed to deploy K8s cluster: failed to deploy cluster addons: failed to deploy K8s cluster addon: Failed to deploy calico: Failed to verify calico: [ Operation timed out: expecting 5 nodes to be running calico-node daemon pod in kube-system namespace. Currently running: 2, Operation timed out: expecting 1 available replica of calico-kube-controllers deployment in kube-system namespace. Currently running: 0 ]
 

Userlevel 4
Badge +5

Hi Igor,

I guess you mean Kubernetes 1.21+. There is plan for supporting this. If you want to know more about timeframe, please reach out to your account team. 

About Calico issue, let’s continue in the other post you opened.

Badge

Yes, 1.21+ sorry it was long day :slight_smile:

I will mark this now as resolved. Thanks!!

Reply