Disk removal task stuck for weeks.

  • 24 October 2023
  • 5 replies


I have a Dell PowerEdge CE single node cluster with all SSDs. The tier 1 SSDs are SAS and the tier 2 SSDs are SATA. I have a SATA SSD that CE decided to remove. The node IDRAC shows all disks are good. 

I need to get the Disk removal task stopped.

Please let me know what I need to do to resolve this issue?

Thanks for any help.


Best answer by JeroenTielen 24 October 2023, 09:03

View original

5 replies

Userlevel 5
Badge +8

Here is a nice guide from Manish how to stop/kill stuck tasks:


Remember: There is no disk resiliency in a single node cluster. ;) 

I tried all of the steps in Manish’s guide and was able to remove the task, but the task reappeared seconds later. Any further help is much appreciated.

Userlevel 1
Badge +3

Try running this command again first trying:


NTNX-A-CVM::~$ ergon_update_task --task_uuid='<Task UUID>' --task_status=aborted


If the status doesn’t change try again with:

NTNX-A-CVM::~$ ergon_update_task --task_uuid='<Task UUID>' --task_status=succeeded

If the task comes back after that then it really is still running and I would investigate running processes and logs to see where the process is getting stuck (and post here for more help).


The task is not showing in the ecli command below.

NTNX-A-CVM::~$ ecli task.list include_completed=false
Task UUID  Parent Task UUID  Component  Sequence-id  Type  Status

The following command did show the task:

NTNX-A-CVM::~$ progress_monitor_cli --fetchall
================== Proto Start =========================
logical_timestamp: 25318
progress_info_id {
  operation: kRemove
  entity_type: kDisk
  entity_id: "19"
title_message: "Removing disk 19 from node x.x.x.x"
start_time_secs: 1696293640
progress_task_list {
  component: kCurator
  task_tag: "Last submitted task count:191443"
  start_time_secs: 1696293640
  last_updated_time_secs: 1699603839
  task_message: "Extent Store Replication"
  percentage_complete: 0
  progress_status: kRunning
  attribute_list {
    attribute_name: "NumSubmittedTasks"
    attribute_value: 191443
  attribute_list {
    attribute_name: "NumFinishedTasks"
    attribute_value: 0
  attribute_list {
    attribute_name: "NumZeroCounts"
    attribute_value: 0
time_to_live_secs: 900
=================== Proto End ==========================

Thank your for the help.

Userlevel 1
Badge +3

If you can dig that task’s UUID out of anything this should at least change the status for you (had a similar issue on ESXi earlier this year where systems would not complete going into or out of maintenance mode even though they had and it worked for me): 


~/bin/ergon_update_task --task_uuid=xxxxx --task_status=succeeded


I’m surprised that you’re getting the message and not seeing anything in ergon… If I run a similar command in my environment I definitely get some output pretty much any time: