Question

adding back to metadata - stuck 98%


Badge +1
Hi All,

we had an issue with one of our 4 nodes, catastrophic hard drive issues, took out the raid, before the raid fully died, we had started the add back into metadata command through the GUI, and it got all the way to 98%, and then hung...

Task Name : Adding node x.x.x.3
Operation : add
Entity : [node]
Entity Id : [6]
Status : running
Percentage Completed : 98
Start Time : 06/30/2018 06:31:45 UTC
End Time :
Last Updated Time : 06/30/2018 06:37:32 UTC

Subtask Message : Transferring metadata from replicas
Component : medusa
Task Tag : Metadata Transfer Phase
Status : running
Percentage Completed : 98
Start Time : 06/30/2018 06:31:45 UTC
End Time : 01/01/1970 00:00:00 UTC
Last Updated Time : 06/30/2018 06:37:32 UTC

this is about when the raid actually quit on node 3, so we could not do anything with node 3, we tried to run the removal command of node 3:
host delete id=000xxxxx-xxxx-xxxx-xxxx-000xxxxxxxxx::x
Error: Cannot mark node for removal. Node 3 (x.x.x.3) is to be added back to metadata store

so we in tandem replaced the disks, rebuilt the raid and tried to re-add the node back in, hoping that some how the process could then detect the node, and cancel and restart the process.

sadly, that did not work.

when looking at the token ring, i can see that there is the following:
Address Status State Load Owns Token
xx
x.x.x.2 Up Normal 30.92 GB 25.00% xx1
x.x.x.3 Down Limbo ? bytes 25.00% FV0
x.x.x.1 Up Normal 29.1 GB 25.00% xx3
x.x.x.4 Up Normal 29.89 GB 25.00% xx4

Node 3 is down, in limbo, and has an odd token of FV000000.....

so currently we have a rebuilt node 3, and we cant get the old node 3 config to be cleared, or have a way of actually cancelling the task that is stuck at 98%, which means we cant then fully remove the old node 3, to then add in the new node 3.

This topic has been closed for comments

24 replies

Badge +1
[h5]List tasks using progress monitor cli[/h5]
progress_monitor_cli -fetchall
[h5]Remove task using progress monitor cli[/h5]
progress_monitor_cli --entity_id= --operation= --entity_type= --delete
# NOTE: operation and entity_type should be all lowercase with k removed from the beginning

this is really helpful to know what is in the notes above...

this was able to remove the task finally!!:
progress_monitor_cli --entity_id=6 --operation=add --entity_type=node --delete

but sadly, it just deleted the task, not actually cancelled the task!! im still getting the same errors..
Userlevel 3
Badge +7
i would suggest to evolve nutanix support
Badge +1
CE edition - is there support available other than the forms, or am i missing something, i thought this is where i needed to post this..
Userlevel 6
Badge +16
On CVM check: acli task.list then task.cancel the removing task. This is the correct forum to post issues regarding CE.
the only tasks we see in acli task.list are succeed, but when we try and remove the node, the error is stating its still trying to be added to metadata.
sorry we have two accounts, 1 for CE and one for our 2 paid for clusters... wished we could call support for this...
Badge +1
I have even tried to go and remove the two disks for that node, and in doing so there is now two tasks, at 0 percent, that wont continue, and of course, no way to cancel them, and no visibility through acli task.list or ncli task list.
Badge +1
if we try and re add the node to metadata I get this:
ncli> host enable-metadata-store id=000xxxxx-xxxx-xxxx-xxxx-000xxxxxxxxx::x
Error: Metadata store is already enabled on host 000xxxxx-xxxx-xxxx-xxxx-000xxxxxxxxx.

yet if we try and remove the host through the gui, we get this error:
Cannot mark node for removal. Node 3 (x.x.x.3) is to be added back to metadata store
Badge +1
found this:
Error: Failed to add node, node with x.x.x.3 is not in normal cassandra status, its cassandra status is kToBeAddedToRing
Badge +1
ok so I think we need to look at removing the token from the ring, and then we might be able to move forward, I found this article:
http://nutanix.blogspot.com/2013/09/what-node-removal-process-does-in.html
but its quite old, and most of it does not match commands that I see.. however I don't know how to get into a host that no longer exists...
.3 has been formatted and is waiting to be rejoined, so I cant log into .3 and remove it, and the other nodes see the token in the ring as in limbo for node 3, but I don't see a removal option.
Badge +1
was able to get this going:
allssh nodetool -h localhost removetoken FV0000000000000000000000000000000000000000000000000000000000
so its been removed from all nodes now as the broken token... trying next steps.
Badge +1
ncli> host remove-start id=3
Error: Cannot mark node for removal. Node 3 (x.x.x.3) is to be added back to metadata store
Badge +1
ncli> host add-node node-uuid=111xxxxx-xxxx-xxxx-xxxx-111xxxxxxxxx
Error: Failed to add node, node with x.x.x.3 is not in normal cassandra status, its cassandra status is kToBeAddedToRing
Badge +1
found this: ncli host ls|grep "Metadata store status"

Metadata store status : Metadata store enabled on the node
Metadata store status : Metadata store enabled on the node
Metadata store status : Node marked to be added back to metadata store
Metadata store status : Metadata store enabled on the node
Badge +1
found this: zeus_config_printer
it shows the following that needs to be removed or cleaned up:

rackable_unit_list {
rackable_unit_id: xx
rackable_unit_serial: "cxxxxxx2"
rackable_unit_model: kUseLayout
rackable_unit_model_name: "CommunityEdition"
rackable_unit_uuid: "000xxxxx-xxxx-xxxx-xxxx-000xxxxxxxxx"
}
dyn_ring_change_info {
service_vm_id_being_removed: -1
service_vm_id_doing_ring_change: -1
service_vm_id_being_added: 6
service_vm_id_disk_being_replaced: -1
disk_being_replaced: -1
service_vm_id_doing_rf_migration: -1
ring_change_start_time: 1529789670037995
ring_change_progress {
ring_change_phase: "Metadata Transfer Phase"
ring_change_percent_complete: 98.484848484848484
last_update_time: 1530341205883705
}
ring_change_op_id: 10640
}
Badge +1
I think even though this task has been removed from progress monitor, I think its still running somewhere within medusa:
Subtask Message : Transferring metadata from replicas
Component : medusa
Task Tag : Metadata Transfer Phase
Status : running
Percentage Completed : 98
Start Time : 06/30/2018 06:31:45 UTC
End Time : 01/01/1970 00:00:00 UTC
Last Updated Time : 06/30/2018 06:37:32 UTC

and doing a cluster stop, and start does not clear it out either..
Userlevel 6
Badge +16
Maybe check this thread: https://next.nutanix.com/discussion-forum-14/cassandra-in-forwarding-mode-28209
Badge +1
Hi Primzy,

im not really sure how that article helps me, the issue is that we have a node that is stuck trying to add to metadata in the cassandra services, and we want to remove this, not add the node back in - as the node needs to be added to prism first, but we cannot due to the two reasons:
1) we already have 4 nodes in CE edition, 1 of which is one we want to remove, to re add.
2) we have already formatted the node in which we are trying to remove, and cannot run commands on that node to remove it, as this node is now a new node.
Badge +1
found this article:
https://next.nutanix.com/discussion-forum-14/unable-to-upgrade-ring-change-event-in-progress-cassandra-status-for-node-is-not-normal-20299

showing this command:
ls -ltr ~/data/logs/*FATAL | tail

which is showing me the following:
/home/nutanix/data/logs/cassandra_monitor.FATAL -> cassandra_monitor.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180706-045055.2823
/home/nutanix/data/logs/chronos_node_main.FATAL -> chronos_node_main.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-020401.25397
/home/nutanix/data/logs/insights_collector.FATAL -> insights_collector.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-132922.15209
/home/nutanix/data/logs/cerebro.FATAL -> cerebro.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-134050.10317
/home/nutanix/data/logs/curator.FATAL -> curator.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-142136.28671
/home/nutanix/data/logs/medusa_printer.FATAL -> medusa_printer.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-142259.23781
/home/nutanix/data/logs/stargate.FATAL -> stargate.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-142411.25277
/home/nutanix/data/logs/pithos.FATAL -> pithos.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-142415.13697
/home/nutanix/data/logs/progress_monitor_cli.FATAL -> progress_monitor_cli.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-155924.25274
/home/nutanix/data/logs/alert_manager.FATAL -> alert_manager.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-161017.9957
Badge +1
looked through some of the logs and found this one:
pithos.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180709-142415.13697

Log file created at: 2018/07/09 14:24:15
Running on machine: ntnx-[nodeid]-a-cvm
Log line format: [iwef]mmdd hh🇲🇲ss.uuuuuu threadid file:line] msg
F0709 14:24:15.086067 13698 pithos_server.cc:749] Hung pithos operation detected with type: update, client_id: 1531160589495951, finished: 0, op_start_walltime_usecs: 1531160594707612, alarm_walltime_usecs: 1531160655082735, master_handle: x.x.x.1:2016, suiciding...

however i dont see any fatal logs associated to the original issue, the stuck adding to metadata
Badge +1
found this:
zeus_config_printer | grep -A15 disk_list

and sure enough the two disks for the node we need to remove are listed there...
Badge +1
disk_list {
disk_id: 54
service_vm_id: 3
mount_path: "/home/nutanix/data/stargate-storage/disks/drive-scsi0-0-0-1"
disk_size: 9306537423668
statfs_disk_size: 9756079063040
storage_tier: "DAS-SATA"
data_dir_sublevels: 2
data_dir_sublevel_dirs: 20
to_remove: true
data_migration_status: 4096
disk_location: 2
contains_metadata: true
oplog_disk_size: 233693798400
disk_serial_id: "drive-scsi0-0-0-1"
disk_uuid: "6305af6b-fff6-4e21-b755-678167f17be6"
--
disk_list {
disk_id: 55
service_vm_id: 3
mount_path: "/home/nutanix/data/stargate-storage/disks/drive-scsi0-0-0-0"
disk_size: 697560743936
statfs_disk_size: 972356157440
storage_tier: "SSD-SATA"
data_dir_sublevels: 2
data_dir_sublevel_dirs: 20
to_remove: true
data_migration_status: 4096
disk_location: 1
contains_metadata: true
oplog_disk_size: 233693798400
ccache_disk_size: 21474836480
disk_serial_id: "drive-scsi0-0-0-0"
--
Badge +1
stopped the cluster and restarted and checked for failures:

one file for today so far:
alert_manager.ntnx-[nodeid]-a-cvm.nutanix.log.FATAL.20180710-123007.21884

Log file created at: 2018/07/10 12:30:07
Running on machine: ntnx-[nodeid]-a-cvm
Log line format: undefined]mmdd hh🇲🇲ss.uuuuuu threadid file:line] msg
F0710 12:30:07.109616 21885 manage_alerts_rpc_op.cc:58] Check failed: query.entity_list(ii).has_entity_id()
Badge +1
SOLVED:

after much frustration we decided to destroy the cluster, and then rebuild, yet all of our issues were still present, destroying the cluster some how still left the tasks behind on the AHV, and there was no way to properly clear them out.

so we decided to format all the nodes remotely and recreate the bootable usb sticks through ipmi, and dmesg / dd to get them back to clean slate, and wiped all raid disk sets.

on new install with our hardware, all disks were being seen as SSDs, no HDDs so we had to do the following:
1) boot off of usb
2) at log in screen log in as the root ahv account
3) ind the HDDs by using either dmesg or fdisk -l, and get the sd? value.
4) place a 0 into the rotational file for that sd?
echo 0 > /sys/block/sdb/queue/rotational
5) exit the root user and go back to the login screen
6) type install and go through the install.


adly, no where did we figure out how to clear up the original problems, and we had to nuke the DEV env again to fix these issues.