I have a problem with a CVM that won’t boot. This is on a semi-retired production cluster (not CE) that has no workloads running on it.
I found the console output in /tmp/NTNX.serial.out.0 and I can see it trying to enable RAID devices, scan for a uuid marker and find 2 of them, then abort and unload the mpt3sas kernel module before trying again in 5 seconds. This repeats a few times before the hypervisor resets it and it starts booting again.
The most relevant sections of the log (copious kernel taint messages removed) are
n 9.543553] sd 2:0:3:0: 0sdd] Attached SCSI disk
svmboot: === SVMBOOT
mdadm main: failed to get exclusive lock on mapfile
9.790075] md: md127 stopped.
mdadm: ignoring /dev/sdb3 as it reports /dev/sda3 as failed
9.794087] md/raid1:md127: active with 1 out of 2 mirrors
9.796034] md127: detected capacity change from 0 to 42915069952
mdadm: /dev/md/phoenix:2 has been started with 1 drive (out of 2).
9.808602] md: md126 stopped.
9.813330] md/raid1:md126: active with 2 out of 2 mirrors
9.815279] md126: detected capacity change from 0 to 10727981056
mdadm: /dev/md/phoenix:1 has been started with 2 drives.
9.832111] md: md125 stopped.
mdadm: ignoring /dev/sdb1 as it reports /dev/sda1 as failed
9.840436] md/raid1:md125: active with 1 out of 2 mirrors
9.842341] md125: detected capacity change from 0 to 10727981056
mdadm: /dev/md/phoenix:0 has been started with 1 drive (out of 2).
mdadm: /dev/md/phoenix:2 exists - ignoring
9.887613] md: md124 stopped.
9.896418] md/raid1:md124: active with 1 out of 2 mirrors
9.898373] md124: detected capacity change from 0 to 42915069952
mdadm: /dev/md124 has been started with 1 drive (out of 2).
mdadm: /dev/md/phoenix:0 exists - ignoring
9.926863] md: md123 stopped.
9.937962] md/raid1:md123: active with 1 out of 2 mirrors
9.939950] md123: detected capacity change from 0 to 10727981056
mdadm: /dev/md123 has been started with 1 drive (out of 2).
svmboot: Checking /dev/md for /.nutanix_active_svm_partition
svmboot: Checking /dev/md123 for /.nutanix_active_svm_partition
9.994541] EXT4-fs (md123): mounted filesystem with ordered data mode. Opts: (null)
svmboot: Appropriate boot partition with /.cvm_uuid at /dev/md123
10.009251] EXT4-fs (md125): mounted filesystem with ordered data mode. Opts: (null)
svmboot: Appropriate boot partition with /.cvm_uuid at /dev/md125
svmboot: Checking /dev/nvme?*p?* for /.nutanix_active_svm_partition
svmboot: error: too many partitions with valid cvm_uuid: /dev/md123 /dev/md125
sh: missing ]
svmboot: Trying again in 5 seconds.
10.430316] md123: detected capacity change from 10727981056 to 0
10.432058] md: md123 stopped.
mdadm: stopped /dev/md123
10.467498] md124: detected capacity change from 42915069952 to 0
10.469245] md: md124 stopped.
mdadm: stopped /dev/md124
10.507492] md125: detected capacity change from 10727981056 to 0
10.509276] md: md125 stopped.
mdadm: stopped /dev/md125
10.547497] md126: detected capacity change from 10727981056 to 0
10.549243] md: md126 stopped.
mdadm: stopped /dev/md126
10.577498] md127: detected capacity change from 42915069952 to 0
10.579245] md: md127 stopped.
mdadm: stopped /dev/md127
10.586750] ata2.00: disabled
modprobe: remove 'virtio_pci': No such file or directory
10.673882] mpt3sas version 14.101.00.00 unloading
As it occurs before the networking has started and gets reset by the hypervisor, I do not have any way of interacting with the VM.
How can this be resolved?