Solved

Disk removal task stuck for weeks.

  • 24 October 2023
  • 5 replies
  • 62 views

Hi,

I have a Dell PowerEdge CE single node cluster with all SSDs. The tier 1 SSDs are SAS and the tier 2 SSDs are SATA. I have a SATA SSD that CE decided to remove. The node IDRAC shows all disks are good. 

I need to get the Disk removal task stopped.

Please let me know what I need to do to resolve this issue?

Thanks for any help.

icon

Best answer by JeroenTielen 24 October 2023, 09:03

View original

5 replies

Userlevel 5
Badge +8

Here is a nice guide from Manish how to stop/kill stuck tasks: https://hyperhci.com/2020/02/14/how-to-kill-nutanix-stuck-hung-task-via-command/

 

Remember: There is no disk resiliency in a single node cluster. ;) 

I tried all of the steps in Manish’s guide and was able to remove the task, but the task reappeared seconds later. Any further help is much appreciated.

Userlevel 1
Badge +3

Try running this command again first trying:

 

NTNX-A-CVM::~$ ergon_update_task --task_uuid='<Task UUID>' --task_status=aborted

 

If the status doesn’t change try again with:

NTNX-A-CVM::~$ ergon_update_task --task_uuid='<Task UUID>' --task_status=succeeded

If the task comes back after that then it really is still running and I would investigate running processes and logs to see where the process is getting stuck (and post here for more help).

@jesseCR 

The task is not showing in the ecli command below.

NTNX-A-CVM::~$ ecli task.list include_completed=false
Task UUID  Parent Task UUID  Component  Sequence-id  Type  Status
 

The following command did show the task:

NTNX-A-CVM::~$ progress_monitor_cli --fetchall
================== Proto Start =========================
logical_timestamp: 25318
progress_info_id {
  operation: kRemove
  entity_type: kDisk
  entity_id: "19"
}
title_message: "Removing disk 19 from node x.x.x.x"
start_time_secs: 1696293640
progress_task_list {
  component: kCurator
  task_tag: "Last submitted task count:191443"
  start_time_secs: 1696293640
  last_updated_time_secs: 1699603839
  task_message: "Extent Store Replication"
  percentage_complete: 0
  progress_status: kRunning
  attribute_list {
    attribute_name: "NumSubmittedTasks"
    attribute_value: 191443
  }
  attribute_list {
    attribute_name: "NumFinishedTasks"
    attribute_value: 0
  }
  attribute_list {
    attribute_name: "NumZeroCounts"
    attribute_value: 0
  }
}
time_to_live_secs: 900
=================== Proto End ==========================

Thank your for the help.

Userlevel 1
Badge +3

If you can dig that task’s UUID out of anything this should at least change the status for you (had a similar issue on ESXi earlier this year where systems would not complete going into or out of maintenance mode even though they had and it worked for me): 

 

~/bin/ergon_update_task --task_uuid=xxxxx --task_status=succeeded

 

I’m surprised that you’re getting the message and not seeing anything in ergon… If I run a similar command in my environment I definitely get some output pretty much any time:

 

Reply