Question

AOS upgrade stalled


Userlevel 1
Badge +2
Hey there, ran the pre-upgrade tests and it passed. I started the upgrade process to 2018.01.31 but it stalled out it seems, if I open a new window can connect to PC I don't get the option to cancel/delete the upgrade. Anyway to fix this?


This topic has been closed for comments

25 replies

Userlevel 1
Badge +2
seems that some of there services are down and zookeeper is flapping?

output of one (of three) nodes
CVM: 10.0.24.32 Up
Zeus UP [2888, 2917, 2918, 2922, 2981, 2997]
Scavenger UP [3540, 3568, 3569, 3571]
SSLTerminator UP [4824, 4856, 4857, 4858]
SecureFileSync UP [4830, 4880, 4881, 4882]
Medusa UP [5052, 5080, 5081, 5082, 5125]
DynamicRingChanger UP [5235, 5314, 5315, 5322]
Pithos UP [5239, 5306, 5307, 5349]
Mantle DOWN []
Hera UP [5244, 5318, 5319, 5320]
Stargate UP [5446, 5478, 5479, 5669, 5673]
InsightsDB UP [5455, 5508, 5509, 5614]
InsightsDataTransfer UP [5484, 5536, 5537, 5599, 5601]
Ergon UP [5514, 5560, 5561, 5562]
Cerebro UP [5557, 5587, 5588, 5859]
Chronos UP [5590, 5636, 5637, 5802]
Curator UP [5595, 5664, 5665, 6007]
Prism UP [5650, 5710, 5711, 5931, 6635, 6637]
CIM UP [5676, 5743, 5744, 5828]
AlertManager UP [2320, 2352, 2353, 2354]
Arithmos UP [5759, 5867, 5868, 5995]
Catalog DOWN []
Acropolis UP [5801, 5901, 5902, 5903]
Uhura UP [5849, 5945, 5946, 5948]
Snmp UP [5891, 5983, 5984, 5985]
SysStatCollector UP [5944, 6079, 6080, 6081]
Tunnel UP [5972, 6046, 6047]
Janus UP [6146, 6219, 6220]
NutanixGuestTools UP [6216, 6263, 6264, 6291]
MinervaCVM UP [6347, 6387, 6388, 6391, 6902]
ClusterConfig UP [6359, 6477, 6478, 6496]
APLOSEngine UP [6411, 6573, 6574, 6575]
APLOS UP [6663, 6732, 6733, 6734, 6922, 6927]
Lazan UP [6668, 6746, 6747, 6757]
Delphi DOWN []
ClusterHealth UP [1300, 1301, 1352, 1353, 1354, 1355, 1360, 1361, 2156, 2157, 31498, 31527, 31528, 31555, 31577, 31578]

nutanix@NTNX-ad5c91ae-A-CVM:10.0.24.30:~$ upgrade_status
2018-03-21 13:45:02 INFO zookeeper_session.py:102 upgrade_status is attempting to connect to Zookeeper
2018-03-21 13:45:02 INFO upgrade_status:38 Target release version: el7.3-release-ce-2018.01.31-stable-c3b9964290bf2f28799481fed5cf32f92ab3dadc
2018-03-21 13:45:02 INFO upgrade_status:43 Cluster upgrade method is set to: automatic rolling upgrade
2018-03-21 13:45:02 INFO upgrade_status:96 SVM 10.0.24.30 still needs to be upgraded. Installed release version: el6-release-ce-2017.07.20-stable-ade616dc5b2ab1cf01b7f42208bbd164ae7a5a3a
2018-03-21 13:45:02 INFO upgrade_status:96 SVM 10.0.24.31 still needs to be upgraded. Installed release version: el6-release-ce-2017.07.20-stable-ade616dc5b2ab1cf01b7f42208bbd164ae7a5a3a
2018-03-21 13:45:02 INFO upgrade_status:96 SVM 10.0.24.32 still needs to be upgraded. Installed release version: el6-release-ce-2017.07.20-stable-ade616dc5b2ab1cf01b7f42208bbd164ae7a5a3a
Userlevel 1
Badge +2
diving into more logs. install.out, message about not being able to format the boot disk?

Upgrading SVM and Nutanix packages
2018-03-21 16:32:10 INFO zookeeper_session.py:110 svm_upgrade is attempting to connect to Zookeeper
2018-03-21 16:32:10 INFO svm_upgrade:164 Looking for another disk to mirror current boot disk /dev/nvme0n1
2018-03-21 16:32:10 INFO svm_upgrade:207 No eligible mirror disks found, proceeding with regular upgrade.
2018-03-21 16:32:10 INFO svm_upgrade:1054 Formatting the boot partition /dev/nvme0n12
2018-03-21 16:32:10 ERROR svm_upgrade:218 Failed to format device /dev/nvme0n12 with ext4, ret 1, stdout , stderr mke2fs 1.41.12 (17-May-2010)
mkfs.ext4: No such file or directory while trying to determine filesystem size
Userlevel 5
Badge +9
That looks problematic. What does progress_monitor_cli --fetchall return?
You could kill the task with progress_monitor_cli --entity_id="" --delete and retry, but given that several services are down, I do not know what will happen, the result might be even worse...
Userlevel 1
Badge +2
2018-03-22 10:47:32,214:29462(0x7fd37d0b3940):ZOO_INFO@log_env@918: Client environment:zookeeper.version=zookeeper C client 3.4.3
2018-03-22 10:47:32,214:29462(0x7fd37d0b3940):ZOO_INFO@log_env@922: Client environment:host.name=NTNX-7fab96d0-A-CVM
2018-03-22 10:47:32,214:29462(0x7fd37d0b3940):ZOO_INFO@log_env@929: Client environment:os.name=Linux
2018-03-22 10:47:32,214:29462(0x7fd37d0b3940):ZOO_INFO@log_env@930: Client environment:os.arch=3.10.0-229.46.1.el6.nutanix.20170428.cvm.x86_64
2018-03-22 10:47:32,215:29462(0x7fd37d0b3940):ZOO_INFO@log_env@931: Client environment:os.version=#1 SMP Fri Apr 28 22:43:05 UTC 2017
2018-03-22 10:47:32,215:29462(0x7fd37d0b3940):ZOO_INFO@zookeeper_init@966: Initiating client connection, host=zk2:9876,zk1:9876,zk3:9876 sessionTimeout=20000 watcher=0x683300 sessionId=0 sessionPasswd= context=0x17e0040 flags=0
2018-03-22 10:47:32,227:29462(0x7fd377e9e700):ZOO_INFO@zookeeper_interest@1878: Connecting to server 10.0.24.30:9876
2018-03-22 10:47:32,228:29462(0x7fd377e9e700):ZOO_INFO@zookeeper_interest@1915: Zookeeper handle state changed to ZOO_CONNECTING_STATE for socket [10.0.24.30:9876]
2018-03-22 10:47:32,228:29462(0x7fd377e9e700):ZOO_INFO@check_events@2089: initiated connection to server [10.0.24.30:9876]
2018-03-22 10:47:32,232:29462(0x7fd377e9e700):ZOO_INFO@check_events@2136: session establishment complete on server [10.0.24.30:9876], sessionId=0x2624bd354554a43, negotiated timeout=20000
================== Proto Start =========================
logical_timestamp: 4353
progress_info_id {
operation: kUpgrade
entity_type: kNode
entity_id: "6"
}
title_message: "Upgrading CVM 10.0.24.32"
start_time_secs: 1521674930
progress_task_list {
component: kGenesis
task_tag: "Downloading from peer"
start_time_secs: 1521674930
end_time_secs: 1521674930
last_updated_time_secs: 1521674930
task_message: "Downloading from peer"
percentage_complete: 100
progress_status: kSucceeded
}
progress_task_list {
component: kGenesis
task_tag: "Installing Acropolis"
start_time_secs: 1521674930
end_time_secs: 1521737244
last_updated_time_secs: 1521737244
task_message: "Installing Acropolis"
percentage_complete: 3
progress_status: kFailed
}
progress_task_list {
component: kGenesis
task_tag: "Waiting for reboot and upgrade completion"
start_time_secs: 1521737244
last_updated_time_secs: 1521737244
task_message: "Waiting for reboot and upgrade completion"
percentage_complete: 0
progress_status: kAborted
}
time_to_live_secs: 600
=================== Proto End ==========================
================== Proto Start =========================
logical_timestamp: 4572
progress_info_id {
operation: kUpgrade
entity_type: kNode
entity_id: "4"
}
title_message: "Upgrading CVM 10.0.24.30"
start_time_secs: 1521656734
progress_task_list {
component: kGenesis
task_tag: "Downloading from peer"
start_time_secs: 1521656734
end_time_secs: 1521656776
last_updated_time_secs: 1521656776
task_message: "Downloading from peer"
percentage_complete: 100
progress_status: kSucceeded
}
progress_task_list {
component: kGenesis
task_tag: "Installing Acropolis"
start_time_secs: 1521656734
end_time_secs: 1521737242
last_updated_time_secs: 1521737242
task_message: "Installing Acropolis"
percentage_complete: 3
progress_status: kFailed
}
progress_task_list {
component: kGenesis
task_tag: "Waiting for reboot and upgrade completion"
start_time_secs: 1521737242
last_updated_time_secs: 1521737242
task_message: "Waiting for reboot and upgrade completion"
percentage_complete: 0
progress_status: kAborted
}
time_to_live_secs: 900
=================== Proto End ==========================
================== Proto Start =========================
logical_timestamp: 4572
progress_info_id {
operation: kUpgrade_Foundation
entity_type: kCluster
entity_id: "890037914372176388"
}
title_message: "Foundation Preupgrade Steps"
start_time_secs: 1521656772
progress_task_list {
component: kGenesis
task_tag: "Foundation Preupgrade Steps"
start_time_secs: 1521656772
end_time_secs: 1521737239
last_updated_time_secs: 1521737239
task_message: "Foundation Preupgrade Steps"
percentage_complete: 10
progress_status: kFailed
}
=================== Proto End ==========================
Userlevel 1
Badge +2
just get errors when trying to remove the tasks, think its fubar. can it be recovered by doing a repair install? the vm's are up and running, dont want to loose them. or need to export them out some how..

Error code bellow.


nutanix@NTNX-7fab96d0-A-CVM:10.0.24.32:~$ progress_monitor_cli --entity_id="5" --delete
2018-03-22 10:57:31,588:24695(0x7f2845eb0940):ZOO_INFO@log_env@918: Client environment:zookeeper.version=zookeeper C client 3.4.3
2018-03-22 10:57:31,588:24695(0x7f2845eb0940):ZOO_INFO@log_env@922: Client environment:host.name=NTNX-7fab96d0-A-CVM
2018-03-22 10:57:31,588:24695(0x7f2845eb0940):ZOO_INFO@log_env@929: Client environment:os.name=Linux
2018-03-22 10:57:31,588:24695(0x7f2845eb0940):ZOO_INFO@log_env@930: Client environment:os.arch=3.10.0-229.46.1.el6.nutanix.20170428.cvm.x86_64
2018-03-22 10:57:31,588:24695(0x7f2845eb0940):ZOO_INFO@log_env@931: Client environment:os.version=#1 SMP Fri Apr 28 22:43:05 UTC 2017
2018-03-22 10:57:31,588:24695(0x7f2845eb0940):ZOO_INFO@zookeeper_init@966: Initiating client connection, host=zk2:9876,zk1:9876,zk3:9876 sessionTimeout=20000 watcher=0x683300 sessionId=0 sessionPasswd= context=0x2a30040 flags=0
2018-03-22 10:57:31,604:24695(0x7f2840c9b700):ZOO_INFO@zookeeper_interest@1878: Connecting to server 10.0.24.30:9876
2018-03-22 10:57:31,605:24695(0x7f2840c9b700):ZOO_INFO@zookeeper_interest@1915: Zookeeper handle state changed to ZOO_CONNECTING_STATE for socket [10.0.24.30:9876]
2018-03-22 10:57:31,605:24695(0x7f2840c9b700):ZOO_INFO@check_events@2089: initiated connection to server [10.0.24.30:9876]
2018-03-22 10:57:31,608:24695(0x7f2840c9b700):ZOO_INFO@check_events@2136: session establishment complete on server [10.0.24.30:9876], sessionId=0x2624bd354554b04, negotiated timeout=20000
F0322 10:57:31.608556 24695 progress_monitor_cli.cc:554] Check failed: info_id.has_operation()
*** Check failure stack trace: ***
Stack traces are generated at /home/nutanix/data/cores/progress_monito.24695.20180322-105731.stack_trace.txt
Userlevel 5
Badge +9
Hi,
I'm not sure how well a repair install preserves data/VMs, but you do get the following options:


Disclaimer: I did not test the repair host option yet, only the other two...
You probably already know this, but if not - to run a repair, log in as root, run ./cleanup.sh, log out and log in as the install user.
Userlevel 1
Badge +2
that was fun, kind of 🙂 yeah I had to completely redo all the nodes. still in a wonky state and not all the services are up. but usable. I'll keep trolling the logs. but think I may just need to wait for the new version. been seeing a couple threads related to 2017 working on Samsung 960's and then not on 2018.
Userlevel 2
Badge +3
that was fun, kind of 🙂 yeah I had to completely redo all the nodes. still in a wonky state and not all the services are up. but usable. I'll keep trolling the logs. but think I may just need to wait for the new version. been seeing a couple threads related to 2017 working on Samsung 960's and then not on 2018.
did you saved your data? i have the same problem... only my cluster is down so i dont have acces to the vm's
Userlevel 1
Badge +2
Hey, yes. doing the repair host (all data preserved) left the VM's untouched. once everything was reinstalled I was able to power up the vm's. there still reminisce of the upgrade failure, but at least you can access the vms
Userlevel 2
Badge +3
Hey, yes. doing the repair host (all data preserved) left the VM's untouched. once everything was reinstalled I was able to power up the vm's. there still reminisce of the upgrade failure, but at least you can access the vms
thanks!
Userlevel 1
Badge +2
Hmm, new version seems to have the same issue. stalled upgrade... I thought this was to be resolved in the new 2018.05.01 version. thoughts?

nutanix@NTNX-6b80bdda-A-CVM:10.0.24.32:~$ progress_monitor_cli --fetchall
2018-06-13 17:55:27,267:12338(0x7fdae4b52940):ZOO_INFO@log_env@918: Client environment:zookeeper.version=zookeeper C client 3.4.3
2018-06-13 17:55:27,267:12338(0x7fdae4b52940):ZOO_INFO@log_env@922: Client environment:host.name=NTNX-6b80bdda-A-CVM
2018-06-13 17:55:27,267:12338(0x7fdae4b52940):ZOO_INFO@log_env@929: Client environment:os.name=Linux
2018-06-13 17:55:27,267:12338(0x7fdae4b52940):ZOO_INFO@log_env@930: Client environment:os.arch=3.10.0-229.46.1.el6.nutanix.20170428.cvm.x86_64
2018-06-13 17:55:27,267:12338(0x7fdae4b52940):ZOO_INFO@log_env@931: Client environment:os.version=#1 SMP Fri Apr 28 22:43:05 UTC 2017
2018-06-13 17:55:27,267:12338(0x7fdae4b52940):ZOO_INFO@zookeeper_init@966: Initiating client connection, host=zk2:9876,zk3:9876,zk1:9876 sessionTimeout=20000 watcher=0x683300 sessionId=0 sessionPasswd= context=0x2750040 flags=0
2018-06-13 17:55:27,291:12338(0x7fdadf93d700):ZOO_INFO@zookeeper_interest@1878: Connecting to server 10.0.24.30:9876
2018-06-13 17:55:27,292:12338(0x7fdadf93d700):ZOO_INFO@zookeeper_interest@1915: Zookeeper handle state changed to ZOO_CONNECTING_STATE for socket [10.0.24.30:9876]
2018-06-13 17:55:27,292:12338(0x7fdadf93d700):ZOO_INFO@check_events@2089: initiated connection to server [10.0.24.30:9876]
2018-06-13 17:55:27,294:12338(0x7fdadf93d700):ZOO_INFO@check_events@2136: session establishment complete on server [10.0.24.30:9876], sessionId=0x2633177b91feb6f, negotiated timeout=20000
================== Proto Start =========================
logical_timestamp: 265
progress_info_id {
operation: kUpgrade
entity_type: kNode
entity_id: "6"
}
title_message: "Upgrading CVM 10.0.24.32"
start_time_secs: 1528933140
progress_task_list {
component: kGenesis
task_tag: "Downloading from peer"
start_time_secs: 1528933140
end_time_secs: 1528933195
last_updated_time_secs: 1528933195
task_message: "Downloading from peer"
percentage_complete: 100
progress_status: kSucceeded
}
progress_task_list {
component: kGenesis
task_tag: "Installing Acropolis"
start_time_secs: 1528933195
end_time_secs: 1528937695
last_updated_time_secs: 1528937695
task_message: "Installing Acropolis"
percentage_complete: 3
progress_status: kFailed
}
progress_task_list {
component: kGenesis
task_tag: "Waiting for reboot and upgrade completion"
start_time_secs: 1528937695
last_updated_time_secs: 1528937695
task_message: "Waiting for reboot and upgrade completion"
percentage_complete: 0
progress_status: kAborted
}
time_to_live_secs: 900
=================== Proto End ==========================
================== Proto Start =========================
logical_timestamp: 270
progress_info_id {
operation: kUpgrade
entity_type: kNode
entity_id: "4"
}
title_message: "Upgrading CVM 10.0.24.30"
start_time_secs: 1528933140
progress_task_list {
component: kGenesis
task_tag: "Downloading from peer"
start_time_secs: 1528933140
end_time_secs: 1528933184
last_updated_time_secs: 1528933184
task_message: "Downloading from peer"
percentage_complete: 100
progress_status: kSucceeded
}
progress_task_list {
component: kGenesis
task_tag: "Installing Acropolis"
start_time_secs: 1528933140
end_time_secs: 1528937718
last_updated_time_secs: 1528937718
task_message: "Installing Acropolis"
percentage_complete: 3
progress_status: kFailed
}
progress_task_list {
component: kGenesis
task_tag: "Waiting for reboot and upgrade completion"
start_time_secs: 1528937718
last_updated_time_secs: 1528937718
task_message: "Waiting for reboot and upgrade completion"
percentage_complete: 0
progress_status: kAborted
}
time_to_live_secs: 900
=================== Proto End ==========================
================== Proto Start =========================
logical_timestamp: 246
progress_info_id {
operation: kUpgrade
entity_type: kNode
entity_id: "5"
}
title_message: "Upgrading CVM 10.0.24.31"
start_time_secs: 1528933140
progress_task_list {
component: kGenesis
task_tag: "Downloading from peer"
start_time_secs: 1528933140
end_time_secs: 1528933187
last_updated_time_secs: 1528933187
task_message: "Downloading from peer"
percentage_complete: 100
progress_status: kSucceeded
}
progress_task_list {
component: kGenesis
task_tag: "Installing Acropolis"
start_time_secs: 1528933140
end_time_secs: 1528937706
last_updated_time_secs: 1528937706
task_message: "Installing Acropolis"
percentage_complete: 3
progress_status: kFailed
}
progress_task_list {
component: kGenesis
task_tag: "Waiting for reboot and upgrade completion"
start_time_secs: 1528937706
last_updated_time_secs: 1528937706
task_message: "Waiting for reboot and upgrade completion"
percentage_complete: 0
progress_status: kAborted
}
time_to_live_secs: 900
=================== Proto End ==========================
Userlevel 1
Badge +2
Any suggestions on this? a bit frustrating that this issue has happens, there was no resolve for the previous version beyond that it was to be fixed in the next release, now im back in the same degraded state, prisim is down, bunch os services are down and things seem to be flapping. VM's are up but thats about it.
Userlevel 2
Could you log into a CVM and give the output of:

tail -F /home/nutanix/data/logs/genesis.out
Userlevel 1
Badge +2
Could you log into a CVM and give the output of:

tail -F /home/nutanix/data/logs/genesis.out

2018-06-14 16:05:15 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.10
2018-06-14 16:05:16 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.11
2018-06-14 16:05:16 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.12
2018-06-14 16:05:20 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:05:20 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:05:20 INFO time_manager.py:534 Starting the NTP daemon
2018-06-14 16:05:20 INFO time_manager.py:232 Set my time is good
2018-06-14 16:05:20 INFO time_manager.py:235 Set ntp_setup_done
2018-06-14 16:05:20 INFO time_manager.py:561 Querying upstream NTP servers: 10.0.24.99 ca.pool.ntp.org
2018-06-14 16:05:21 INFO time_manager.py:581 NTP offset: -0.003 seconds
2018-06-14 16:05:33 INFO zookeeper_service.py:573 Zookeeper is running as follower
2018-06-14 16:05:33 INFO zookeeper_service.py:573 Zookeeper is running as follower
2018-06-14 16:05:34 INFO utils.py:391 Waiting for ErgonService to come up
2018-06-14 16:05:48 INFO ha_service.py:859 Alive Stargates: ['10.0.24.31', '10.0.24.32', '10.0.24.30']
2018-06-14 16:05:48 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.10
2018-06-14 16:05:48 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.11
2018-06-14 16:05:49 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.12
2018-06-14 16:06:04 ERROR cluster_conversion.py:1010 Morphos and Ergon services are not up, cannot check for convert cluster validation/start tasks
2018-06-14 16:06:20 INFO ha_service.py:859 Alive Stargates: ['10.0.24.31', '10.0.24.32', '10.0.24.30']
2018-06-14 16:06:20 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.10
2018-06-14 16:06:21 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.11
2018-06-14 16:06:21 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.12
2018-06-14 16:06:47 INFO service_mgmt_utils.py:120 zk node not created: /appliance/logical/cluster_disabled_services
2018-06-14 16:06:48 INFO zookeeper_service.py:573 Zookeeper is running as follower
2018-06-14 16:06:48 INFO zookeeper_service.py:573 Zookeeper is running as follower
2018-06-14 16:06:50 INFO node_manager.py:1928 Certificate signing request data is not available in Zeus configuration
2018-06-14 16:06:50 INFO node_manager.py:1841 No CA certificates found in the Zeus configuration
2018-06-14 16:06:50 INFO node_manager.py:1844 No Svm certificates found in the Zeus configuration
2018-06-14 16:06:50 INFO node_manager.py:1847 No Svm certificate maps found in the Zeus configuration
2018-06-14 16:06:50 INFO node_manager.py:2038 Certificate cache in sync
2018-06-14 16:06:50 INFO node_manager.py:4641 Checking if we need to sync the local SVM and Hypervisor DNS configuration with Zookeeper
2018-06-14 16:06:50 INFO salt_helper.py:53 Verifying CVM salt states
2018-06-14 16:06:50 INFO salt_helper.py:117 Verifying Hypervisor salt states
2018-06-14 16:06:51 INFO salt_helper.py:187 Set hypervisor cron schedule to daily
2018-06-14 16:06:51 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 16:06:51 INFO helper.py:177 Using salt firewall framework
2018-06-14 16:06:51 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:06:51 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:06:51 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:06:51 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:06:51 INFO helper.py:366 No change in proto-based rules required
2018-06-14 16:06:52 INFO ha_service.py:859 Alive Stargates: ['10.0.24.31', '10.0.24.32', '10.0.24.30']
2018-06-14 16:06:52 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.10
2018-06-14 16:06:53 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.11
2018-06-14 16:06:54 INFO ha_service.py:329 Discovering whether storage traffic is currently being forwarded on host 10.0.24.12
2018-06-14 16:07:07 CRITICAL node_manager.py:2241 Failed to run the installer, ret 1
tail: `/home/nutanix/data/logs/genesis.out' has been replaced; following end of new file
2018-06-14 16:07:10 rolled over log file
2018-06-14 16:07:12 INFO server.py:120 GENESIS START
2018-06-14 16:07:12 INFO server.py:125 Factory config file is found
2018-06-14 16:07:12 INFO server.py:132 Starting the serve_http thread
2018-06-14 16:07:12 INFO layout_updates.py:428 Inspecting hardware layout file for updates.
2018-06-14 16:07:12 INFO node_manager.py:1441 Assigning IP address 192.168.5.2 to eth1
2018-06-14 16:07:12 ERROR sudo.py:25 Failed to load file /var/run/dhclient-eth1.pid, ret 1, stdout , stderr cat: /var/run/dhclient-eth1.pid: No such file or directory

2018-06-14 16:07:13 INFO node_manager.py:1441 Assigning IP address 192.168.5.254 to eth1:1
2018-06-14 16:07:13 ERROR sudo.py:25 Failed to load file /var/run/dhclient-eth1:1.pid, ret 1, stdout , stderr cat: /var/run/dhclient-eth1:1.pid: No such file or directory

2018-06-14 16:07:14 INFO node_manager.py:1786 No CA certificates found in the local filesystem cache
2018-06-14 16:07:14 INFO node_manager.py:1788 Filesystem has 0 CA certificates in the cache
2018-06-14 16:07:14 INFO node_manager.py:1799 No Svm certificates found in the local filesystem cache
2018-06-14 16:07:14 INFO node_manager.py:1801 Filesystem has 0 Svm certificates in the cache
2018-06-14 16:07:14 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:14 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:14 INFO node_manager.py:4693 Svm has configured ip 10.0.24.30 and device eth0 has ip 10.0.24.30
2018-06-14 16:07:14 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:14 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:15 ERROR ipv4config.py:1595 Unable to get the KVM device configuration, ret 1, stdout , stderr br0-backplane: error fetching interface information: Device not found

2018-06-14 16:07:15 INFO node_manager.py:4798 Could not discover the hypervisor backplane IP configuration
2018-06-14 16:07:16 ERROR kvm.py:407 /usr/local/nutanix/cluster/lib/esx5/Menu.dat does not exist on the CVM
2018-06-14 16:07:16 WARNING node_manager.py:4827 Unable to transfer hypervisor-specific tools to the hypervisor
2018-06-14 16:07:16 INFO kvm_utils.py:141 Interface with mac address 50:6b:8d:94:88:8a does not have vlan id
2018-06-14 16:07:16 WARNING node_manager.py:4844 Unable to discover node identity using IPMI
2018-06-14 16:07:16 INFO node_manager.py:4848 Current installed software:
nutanix_release_version: el6-release-ce-2017.07.20-stable-ade616dc5b2ab1cf01b7f42208bbd164ae7a5a3a
2018-06-14 16:07:16 INFO node_manager.py:4849 Current svm ip: 10.0.24.30
2018-06-14 16:07:16 INFO node_manager.py:4850 Current zookeeper mapping: {'10.0.24.30': 2, '10.0.24.31': 3, '10.0.24.32': 1}
2018-06-14 16:07:16 INFO node_manager.py:4852 Current zookeeper config version: 1
2018-06-14 16:07:17 INFO kvm_utils.py:141 Interface with mac address 50:6b:8d:94:88:8a does not have vlan id
2018-06-14 16:07:18 INFO kvm_upgrade_helper.py:448 Found release marker: el7.nutanix.20170627
2018-06-14 16:07:18 INFO kvm_upgrade_helper.py:169 Current hypervisor version: el7.nutanix.20170627
2018-06-14 16:07:19 INFO kvm_upgrade_helper.py:448 Found release marker: el7.nutanix.20170627
2018-06-14 16:07:19 INFO kvm_upgrade_helper.py:169 Current hypervisor version: el7.nutanix.20170627
2018-06-14 16:07:19 INFO zookeeper_service.py:218 Zookeeper ensemble from node service is, ['10.0.24.30', '10.0.24.31', '10.0.24.32']
2018-06-14 16:07:19 INFO zookeeper_service.py:219 Last known zookeeper ensemble is, {'10.0.24.30': 2, '10.0.24.31': 3, '10.0.24.32': 1}
2018-06-14 16:07:19 INFO node_manager.py:3384 Reading the cached Zeus config to get the cluster type
2018-06-14 16:07:19 INFO lifecycle_manager.py:40 LCM framework module not found
2018-06-14 16:07:19 INFO interfaces.py:39 Starting ClusterManger
2018-06-14 16:07:19 INFO interfaces.py:44 Starting SnapshotManger
2018-06-14 16:07:19 INFO interfaces.py:34 Starting GenesisRpcService
2018-06-14 16:07:19 INFO node_manager.py:5121 Stopping Foundation service
2018-06-14 16:07:19 INFO node_manager.py:3338 Removing tagged interfaces []
2018-06-14 16:07:19 WARNING node_manager.py:3356 Route change failed: 0
2018-06-14 16:07:19 INFO node_manager.py:5929 Zookeeper start is unnecessary
2018-06-14 16:07:19 INFO node_manager.py:2452 Starting NodeManager
2018-06-14 16:07:19 INFO node_manager.py:1441 Assigning IP address 192.168.5.2 to eth1
2018-06-14 16:07:19 ERROR sudo.py:25 Failed to load file /var/run/dhclient-eth1.pid, ret 1, stdout , stderr cat: /var/run/dhclient-eth1.pid: No such file or directory

2018-06-14 16:07:20 INFO node_manager.py:1441 Assigning IP address 192.168.5.254 to eth1:1
2018-06-14 16:07:20 ERROR sudo.py:25 Failed to load file /var/run/dhclient-eth1:1.pid, ret 1, stdout , stderr cat: /var/run/dhclient-eth1:1.pid: No such file or directory

2018-06-14 16:07:20 INFO node_manager.py:5829 Trying to sync hosts file
2018-06-14 16:07:20 INFO node_manager.py:5866 /etc/hosts updation is unnecessary
2018-06-14 16:07:20 INFO node_manager.py:2468 Node is found configured with ip:10.0.24.30 zkmap:{'10.0.24.30': 2, '10.0.24.31': 3, '10.0.24.32': 1}
2018-06-14 16:07:20 INFO cpu_unblock_utils.py:66 Spawned cpu_unblock with PID 21428
2018-06-14 16:07:20 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 16:07:20 INFO node_manager.py:3384 Reading the cached Zeus config to get the cluster type
2018-06-14 16:07:20 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 16:07:20 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:20 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:20 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:20 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:20 INFO zookeeper_session.py:110 genesis is attempting to connect to Zookeeper
2018-06-14 16:07:20 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:20 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:20 INFO utils.py:500 Detected current state as kBaseConfig for firewall config
2018-06-14 16:07:21 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:21 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:21 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:21 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:21 INFO salt.py:268 Executing salt call for CVM
2018-06-14 16:07:23 INFO salt.py:277 Successfully executed salt command
2018-06-14 16:07:23 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:23 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:23 INFO helper.py:366 No change in proto-based rules required
2018-06-14 16:07:23 INFO node_manager.py:4619 iptables reapplied successfully
2018-06-14 16:07:23 INFO node_manager.py:2477 Firewall initialized
2018-06-14 16:07:23 INFO node_manager.py:5921 Configuring zookeeper
2018-06-14 16:07:23 INFO node_manager.py:3384 Reading the cached Zeus config to get the cluster type
2018-06-14 16:07:23 INFO zookeeper_service.py:299 Environment variables are modified; current shell may need to be updated
2018-06-14 16:07:23 INFO node_manager.py:5929 Zookeeper start is unnecessary
2018-06-14 16:07:23 INFO node_manager.py:2493 Trying to join the configured cluster with zookeepers ['10.0.24.30', '10.0.24.31', '10.0.24.32']...
2018-06-14 16:07:23 INFO zookeeper_session.py:110 genesis is attempting to connect to Zookeeper
2018-06-14 16:07:23 INFO node_manager.py:2515 Successfully joined.
2018-06-14 16:07:23 INFO kvm_utils.py:141 Interface with mac address 50:6b:8d:94:88:8a does not have vlan id
2018-06-14 16:07:24 INFO kvm_upgrade_helper.py:448 Found release marker: el7.nutanix.20170627
2018-06-14 16:07:24 INFO kvm_upgrade_helper.py:169 Current hypervisor version: el7.nutanix.20170627
2018-06-14 16:07:24 INFO zookeeper_service.py:590 Creating ZookeeperSession for the first time
2018-06-14 16:07:24 INFO zookeeper_session.py:110 genesis is attempting to connect to Zookeeper
2018-06-14 16:07:24 INFO zookeeper_session.py:110 genesis is attempting to connect to Zookeeper
2018-06-14 16:07:24 INFO zookeeper_session.py:453 Closing old zookeeper session: KANA1H0AZANHYJzWyx0SWO34Zrgq740E
2018-06-14 16:07:24 INFO zookeeper_service.py:600 Enabling Zookeeper connection watchdog
2018-06-14 16:07:24 INFO node_manager.py:1841 No CA certificates found in the Zeus configuration
2018-06-14 16:07:24 INFO node_manager.py:1844 No Svm certificates found in the Zeus configuration
2018-06-14 16:07:24 INFO node_manager.py:1847 No Svm certificate maps found in the Zeus configuration
2018-06-14 16:07:24 INFO node_manager.py:2106 Copying ca.pem file to the host
2018-06-14 16:07:24 INFO node_manager.py:2122 CA certificate bundle /home/nutanix/certs/ca.pem is updated
2018-06-14 16:07:24 INFO node_manager.py:2531 Configuring Hades
2018-06-14 16:07:26 ERROR node_manager.py:2538 Failed to configure Hades
2018-06-14 16:07:26 INFO node_manager.py:2564 Starting the trigger_firewall_redeploy thread
2018-06-14 16:07:26 INFO node_manager.py:2587 Starting the sync_configuration_thr
2018-06-14 16:07:26 INFO node_manager.py:1928 Certificate signing request data is not available in Zeus configuration
2018-06-14 16:07:26 INFO node_manager.py:1841 No CA certificates found in the Zeus configuration
2018-06-14 16:07:26 INFO node_manager.py:1844 No Svm certificates found in the Zeus configuration
2018-06-14 16:07:26 INFO node_manager.py:1847 No Svm certificate maps found in the Zeus configuration
2018-06-14 16:07:26 INFO node_manager.py:2038 Certificate cache in sync
2018-06-14 16:07:26 INFO node_manager.py:3943 Syncing NTP settings on Acropolis managed host
2018-06-14 16:07:26 INFO node_manager.py:6652 Configuring the hypervisor with NTP servers
2018-06-14 16:07:27 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:27 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:28 INFO node_manager.py:4641 Checking if we need to sync the local SVM and Hypervisor DNS configuration with Zookeeper
2018-06-14 16:07:28 INFO node_manager.py:4655 Dns settings on the svm None are not in-sync with zeus's set([u'10.0.24.22', u'10.0.24.21', u'1.1.1.1'])
2018-06-14 16:07:28 INFO node_manager.py:4661 Name server list without duplicates: 1.1.1.1 10.0.24.21 10.0.24.22
2018-06-14 16:07:28 INFO node_manager.py:4672 Sycning DNS settings for Acropolis managed host
2018-06-14 16:07:28 INFO node_manager.py:6676 Configuring the hypervisor with DNS servers set([u'10.0.24.22', u'10.0.24.21', u'1.1.1.1'])
2018-06-14 16:07:28 INFO node_manager.py:6688 No DNS configuration change on the hypervisor required
2018-06-14 16:07:28 INFO salt_helper.py:53 Verifying CVM salt states
2018-06-14 16:07:28 INFO salt_helper.py:117 Verifying Hypervisor salt states
2018-06-14 16:07:29 INFO salt_helper.py:187 Set hypervisor cron schedule to daily
2018-06-14 16:07:29 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 16:07:29 INFO helper.py:177 Using salt firewall framework
2018-06-14 16:07:29 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:29 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:29 INFO ipv4config.py:809 Netmask is 255.255.255.0
2018-06-14 16:07:29 INFO ipv4config.py:843 Discovered network information: hwaddr 50:6b:8d:94:88:8a, address 10.0.24.30, netmask 255.255.255.0, gateway 10.0.24.1, vlan None
2018-06-14 16:07:29 INFO helper.py:366 No change in proto-based rules required
2018-06-14 16:07:29 INFO node_manager.py:4519 Adding keys to authorized keys file
2018-06-14 16:07:29 INFO sshkeys_helper.py:396 Removing keys from admin authorized_keys
2018-06-14 16:07:29 ERROR sudo.py:158 Failed to run cmd sudo mv /tmp/tmpK1lqbS /home/admin/.ssh/authorized_keys2 as with error mv: cannot move `/tmp/tmpK1lqbS' to `/home/admin/.ssh/authorized_keys2': No such file or directory

2018-06-14 16:07:29 INFO node_manager.py:4523 Adding certs to authorized certs file
2018-06-14 16:07:30 INFO node_manager.py:6772 Changing sshd password authentication to True
2018-06-14 16:07:30 INFO node_manager.py:6781 Restarting sshd
2018-06-14 16:07:31 INFO kvm.py:707 Updating the password authentication on the local KVM host
2018-06-14 16:07:31 INFO salt_helper.py:237 Invoking salt lockdown
Userlevel 1
Badge +2
everything is down now.. all vm's died about an hour ago... 😕 very frustrating.
Userlevel 2
So back up to the chief complaint, you are doing an upgrade, and its hung. Are any of the nodes at 99 or 100%?

SSH/putty into a CVM and post the output for:

upgrade_status

The reason I ask, is we upgrade a first node, reboot that, and then start the upgrade on the other nodes. There is a token involved that we pass off when the nodes are upgraded and complete. If something gets bungled up, that token doesn't transfer.

This could be because of several things, but most notably what i have seen is when all the nodes aren't time sync'd.

If they are all sync'd then we have to find the node that is holding the token and restart it with:

"genesis restart"

If you could look through the logs, and look for errors that look something like this:

2014-12-18 15:51:24 ERROR node_manager.py:1728 Failed to start ArithmosService. Fix the problem and start again.

or more importantly, in your case this, we can ascertain what the hang up is.

2014-12-18 15:51:43 INFO cluster_manager.py:2054 Not releasing token since services are not up and running
Userlevel 1
Badge +2
So back up to the chief complaint, you are doing an upgrade, and its hung. Are any of the nodes at 99 or 100%?

SSH/putty into a CVM and post the output for:

upgrade_status

The reason I ask, is we upgrade a first node, reboot that, and then start the upgrade on the other nodes. There is a token involved that we pass off when the nodes are upgraded and complete. If something gets bungled up, that token doesn't transfer.

This could be because of several things, but most notably what i have seen is when all the nodes aren't time sync'd.

If they are all sync'd then we have to find the node that is holding the token and restart it with:

"genesis restart"

If you could look through the logs, and look for errors that look something like this:

2014-12-18 15:51:24 ERROR node_manager.py:1728 Failed to start ArithmosService. Fix the problem and start again.

or more importantly, in your case this, we can ascertain what the hang up is.

2014-12-18 15:51:43 INFO cluster_manager.py:2054 Not releasing token since services are not up and running


I did the pre upgrade check and it passed. started the process and in the gui it only showed that it was working on the fist node and them prism timed, the percentage was very similar to the only screenshot I took for that Jan release.

Heres the output for upgrade_status

2018-06-14 17:55:18 INFO zookeeper_session.py:102 upgrade_status is attempting to connect to Zookeeper
2018-06-14 17:55:18 INFO upgrade_status:38 Target release version: el7.3-release-ce-2018.05.01-stable-bd83b08c5cc713fa717482dc27275389d5bd16ee
2018-06-14 17:55:18 INFO upgrade_status:43 Cluster upgrade method is set to: automatic rolling upgrade
2018-06-14 17:55:18 INFO upgrade_status:96 SVM 10.0.24.30 still needs to be upgraded. Installed release version: el6-release-ce-2017.07.20-stable-ade616dc5b2ab1cf01b7f42208bbd164ae7a5a3a
2018-06-14 17:55:18 INFO upgrade_status:96 SVM 10.0.24.31 still needs to be upgraded. Installed release version: el6-release-ce-2017.07.20-stable-ade616dc5b2ab1cf01b7f42208bbd164ae7a5a3a
2018-06-14 17:55:18 INFO upgrade_status:96 SVM 10.0.24.32 still needs to be upgraded. Installed release version: el6-release-ce-2017.07.20-stable-ade616dc5b2ab1cf01b7f42208bbd164ae7a5a3a


when grep'ing the logs the only thing that comes up with a refrence to token is about cassandra

2018-06-14 17:51:42 INFO cassandra_service.py:449 total nodes 3, new_node_flag false, cassandra state 0, cassandra token fKfKfKfKgQwKOhuTvJ26KmyHO8cx0dv2DsTj094iB2TeW3J4pStBnYEiTh4G

another error

2018-06-14 18:02:22 ERROR kvm_upgrade_helper.py:1394 Unable to find acropolis bundle, cannot populate upgrade info
2018-06-14 18:02:22 ERROR cluster_manager.py:3951 Zknode /appliance/logical/genesis/firmware_upgrade_params is empty, skipping firmware upgrad e
Userlevel 1
Badge +2
Noticed some more after I did the 'genesis restart'

2018-06-14 18:09:33 ERROR snmp_service.py:86 Failed to start alert_manager
2018-06-14 18:09:33 ERROR node_manager.py:2828 Failed to start SnmpService. Fix the problem and start again.

2018-06-14 18:09:34 ERROR secure_file_sync_service.py:99 Failed to start secure_file_sync
2018-06-14 18:09:34 ERROR node_manager.py:2828 Failed to start SecureFileSyncService. Fix the problem and start again.
Userlevel 1
Badge +2
grep'd ERROR log

2018-06-14 18:22:04 ERROR sudo.py:25 Failed to load file /var/run/dhclient-eth1.pid, ret 1, stdout , stderr cat: /var/run/dhclient-eth1.pid: No such file or directory
2018-06-14 18:22:04 ERROR sudo.py:25 Failed to load file /var/run/dhclient-eth1:1.pid, ret 1, stdout , stderr cat: /var/run/dhclient-eth1:1.pid: No such file or directory
2018-06-14 18:22:06 ERROR ipv4config.py:1595 Unable to get the KVM device configuration, ret 1, stdout , stderr br0-backplane: error fetching interface information: Device not found
2018-06-14 18:22:07 ERROR kvm.py:407 /usr/local/nutanix/cluster/lib/esx5/Menu.dat does not exist on the CVM
2018-06-14 18:22:10 ERROR sudo.py:25 Failed to load file /var/run/dhclient-eth1.pid, ret 1, stdout , stderr cat: /var/run/dhclient-eth1.pid: No such file or directory
2018-06-14 18:22:11 ERROR sudo.py:25 Failed to load file /var/run/dhclient-eth1:1.pid, ret 1, stdout , stderr cat: /var/run/dhclient-eth1:1.pid: No such file or directory
2018-06-14 18:22:12 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 18:22:12 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 18:22:17 ERROR node_manager.py:2538 Failed to configure Hades
2018-06-14 18:22:20 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 18:22:21 ERROR sudo.py:158 Failed to run cmd sudo mv /tmp/tmp1YjuZs /home/admin/.ssh/authorized_keys2 as with error mv: cannot move `/tmp/tmp1YjuZs' to `/home/admin/.ssh/authorized_keys2': No such file or directory
2018-06-14 18:22:55 ERROR salt_helper.py:261 Error executing /srv/salt/statechange lockdown off command on host. ret: -1, stdout: stderr:
2018-06-14 18:22:56 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 18:22:56 ERROR service_mgmt_utils.py:142 File /home/nutanix/config/genesis/service_profiles/default_disabled_services.json does not exist
2018-06-14 18:22:56 ERROR cluster_manager.py:548 Error in setting default services in zookeeper
2018-06-14 18:22:56 ERROR cluster_manager.py:3835 Could not initialize disabled services.
2018-06-14 18:22:56 ERROR kvm_upgrade_helper.py:1394 Unable to find acropolis bundle, cannot populate upgrade info
2018-06-14 18:22:56 ERROR cluster_manager.py:3951 Zknode /appliance/logical/genesis/firmware_upgrade_params is empty, skipping firmware upgrade
2018-06-14 18:22:59 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 18:23:00 ERROR utils.py:505 kvm_host_bundle not found in /home/nutanix/data/installer/el6-release-ce-2017.07.20-stable-ade616dc5b2ab1cf01b7f42208bbd164ae7a5a3a
2018-06-14 18:23:00 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 18:23:00 ERROR notify.py:103 notification=StargateStatus available=True ip_address=10.0.24.31 service_vm_id=5 downtime=0
2018-06-14 18:23:00 ERROR notify.py:103 notification=StargateStatus available=True ip_address=10.0.24.30 service_vm_id=4 downtime=0
2018-06-14 18:23:00 ERROR notify.py:103 notification=StargateStatus available=True ip_address=10.0.24.32 service_vm_id=6 downtime=0
2018-06-14 18:23:04 ERROR command.py:155 Failed to execute systemctl is-active iptables: [Errno 2] No such file or directory
2018-06-14 18:23:11 ERROR secure_file_sync_service.py:99 Failed to start secure_file_sync
2018-06-14 18:23:11 ERROR node_manager.py:2828 Failed to start SecureFileSyncService. Fix the problem and start again.
Userlevel 1
Badge +2
Any help is welcome, acropolis is down so can't even get into acli to try to export vms. this is not good.
Userlevel 7
Badge +34
Hi @jantjo

If you find an answer that helps - please consider clicking the 'like' and 'best answer' link as that will help others in the community find the answer much quicker. Thanks for being part of the community! 👍
Userlevel 6
Badge +16
What is the current status @jantjo?
Badge +3
What is the current status @jantjo?
Unfortunately I lost everything and went a different direction for now. As my setup has NVMe for the boot/hot tier and is not support (at least in the 2018 stream for some reason) that was the cause of the issue.
Userlevel 6
Badge +16
Bellow is what I've determined, only version that supports nvme boot(hot tier) is the 2017 version. wish they would enable/fix this but not to sure why the change was made.

SSD=Hot Tier/HDD=Cold Tier - This is supported
SSD=Hot Tier/SSD=Cold Tier - This is supported
SSD=Hot Tier/NVMe=Cold Tier - This is supported (*)
NVME=Hot Tier/SSD=Cold Tier - This is not supported (*)
NVMe=Hot Tier/HDD=Cold Tier - This is not supported.

(*) from the release notes, due to current limitations of the install you cannot select the hot tier target so the hot tier needs to be smaller capacity then the cold tier.
Userlevel 1
Badge +2
Bellow is what I've determined, only version that supports nvme boot(hot tier) is the 2017 version. wish they would enable/fix this but not to sure why the change was made.

SSD=Hot Tier/HDD=Cold Tier - This is supported
SSD=Hot Tier/SSD=Cold Tier - This is supported
SSD=Hot Tier/NVMe=Cold Tier - This is supported (*)
NVME=Hot Tier/SSD=Cold Tier - This is not supported (*)
NVMe=Hot Tier/HDD=Cold Tier - This is not supported.

(*) from the release notes, due to current limitations of the install you cannot select the hot tier target so the hot tier needs to be smaller capacity then the cold tier.


Thanks for quoting me :)

https://next.nutanix.com/discussion-forum-14/ce-nuc-cluster-hardware-question-28276/index1.html#post31374