CVM fails to boot due to disk errors

Question

I have a problem with a CVM that won’t boot. This is on a semi-retired production cluster (not CE) that has no workloads running on it.

I found the console output in /tmp/NTNX.serial.out.0 and I can see it trying to enable RAID devices, scan for a uuid marker and find 2 of them, then abort and unload the mpt3sas kernel module before trying again in 5 seconds. This repeats a few times before the hypervisor resets it and it starts booting again.

The most relevant sections of the log (copious kernel taint messages removed) are

n    9.543553] sd 2:0:3:0: 0sdd] Attached SCSI disk
svmboot: === SVMBOOT
mdadm main: failed to get exclusive lock on mapfile
     9.790075] md: md127 stopped.
mdadm: ignoring /dev/sdb3 as it reports /dev/sda3 as failed
     9.794087] md/raid1:md127: active with 1 out of 2 mirrors
     9.796034] md127: detected capacity change from 0 to 42915069952
mdadm: /dev/md/phoenix:2 has been started with 1 drive (out of 2).
     9.808602] md: md126 stopped.
     9.813330] md/raid1:md126: active with 2 out of 2 mirrors
     9.815279] md126: detected capacity change from 0 to 10727981056
mdadm: /dev/md/phoenix:1 has been started with 2 drives.
     9.832111] md: md125 stopped.
mdadm: ignoring /dev/sdb1 as it reports /dev/sda1 as failed
     9.840436] md/raid1:md125: active with 1 out of 2 mirrors
     9.842341] md125: detected capacity change from 0 to 10727981056
mdadm: /dev/md/phoenix:0 has been started with 1 drive (out of 2).
mdadm: /dev/md/phoenix:2 exists - ignoring
     9.887613] md: md124 stopped.
     9.896418] md/raid1:md124: active with 1 out of 2 mirrors
     9.898373] md124: detected capacity change from 0 to 42915069952
mdadm: /dev/md124 has been started with 1 drive (out of 2).
mdadm: /dev/md/phoenix:0 exists - ignoring
     9.926863] md: md123 stopped.
     9.937962] md/raid1:md123: active with 1 out of 2 mirrors
     9.939950] md123: detected capacity change from 0 to 10727981056
mdadm: /dev/md123 has been started with 1 drive (out of 2).
svmboot: Checking /dev/md for /.nutanix_active_svm_partition
svmboot: Checking /dev/md123 for /.nutanix_active_svm_partition

     9.994541] EXT4-fs (md123): mounted filesystem with ordered data mode. Opts: (null)
svmboot: Appropriate boot partition with /.cvm_uuid at /dev/md123

    10.009251] EXT4-fs (md125): mounted filesystem with ordered data mode. Opts: (null)
svmboot: Appropriate boot partition with /.cvm_uuid at /dev/md125

svmboot: Checking /dev/nvme?*p?* for /.nutanix_active_svm_partition
svmboot: error: too many partitions with valid cvm_uuid:  /dev/md123 /dev/md125
sh: missing ]
svmboot: Trying again in 5 seconds.

    10.430316] md123: detected capacity change from 10727981056 to 0
    10.432058] md: md123 stopped.
mdadm: stopped /dev/md123
    10.467498] md124: detected capacity change from 42915069952 to 0
    10.469245] md: md124 stopped.
mdadm: stopped /dev/md124
    10.507492] md125: detected capacity change from 10727981056 to 0
    10.509276] md: md125 stopped.
mdadm: stopped /dev/md125
    10.547497] md126: detected capacity change from 10727981056 to 0
    10.549243] md: md126 stopped.
mdadm: stopped /dev/md126
    10.577498] md127: detected capacity change from 42915069952 to 0
    10.579245] md: md127 stopped.
mdadm: stopped /dev/md127
    10.586750] ata2.00: disabled
modprobe: remove 'virtio_pci': No such file or directory
    10.673882] mpt3sas version 14.101.00.00 unloading

As it occurs before the networking has started and gets reset by the hypervisor, I do not have any way of interacting with the VM.

How can this be resolved?

waddles · Accepted Answer

After much mucking around, I wasfinally able to boot a System Rescue CD which hadaccess to the RAID disks so I could fix it.FYI - the hypervisor boots from the SATADOM but it does not have a device driver for the SAS HBA device so it cannot normally see the storage disks. The hypervisor boots the CVM which has a SAS device driver (mpt3sas), therefore all disk access is done through the CVM. The CVM boots off software RAID devices using the first 3 partitions of the SSDs.In my case, 2 of the software RAID devices had lost sync.# lsscsi[0:0:0:0]    disk    ATA      INTEL SSDSC2BX80 0140  /dev/sdb[0:0:1:0]    disk    ATA      ST2000NX0253     SN05  /dev/sda[0:0:2:0]    disk    ATA      ST2000NX0253     SN05  /dev/sdc[0:0:3:0]    disk    ATA      ST2000NX0253     SN05  /dev/sde[0:0:4:0]    disk    ATA      ST2000NX0253     SN05  /dev/sdd[0:0:5:0]    disk    ATA      INTEL SSDSC2BX80 0140  /dev/sdg[4:0:0:0]    disk    ATA      SATADOM-SL 3ME   119   /dev/sdf[11:0:0:0]   cd/dvd  ATEN     Virtual CDROM    YS0J  /dev/sr0# lsblkNAME      MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTloop0       7:0    0 632.2M  1 loop  /run/archiso/sfs/airootfssda         8:0    0   1.8T  0 disk└─sda1      8:1    0   1.8T  0 partsdb         8:16   0 745.2G  0 disk├─sdb1      8:17   0    10G  0 part│ └─md127   9:127  0    10G  0 raid1├─sdb2      8:18   0    10G  0 part│ └─md125   9:125  0    10G  0 raid1├─sdb3      8:19   0    40G  0 part│ └─md126   9:126  0    40G  0 raid1└─sdb4      8:20   0 610.6G  0 partsdc         8:32   0   1.8T  0 disk└─sdc1      8:33   0   1.8T  0 partsdd         8:48   0   1.8T  0 disk└─sdd1      8:49   0   1.8T  0 partsde         8:64   0   1.8T  0 disk└─sde1      8:65   0   1.8T  0 partsdf         8:80   0  59.6G  0 disk└─sdf1      8:81   0  59.6G  0 partsdg         8:96   0 745.2G  0 disk├─sdg1      8:97   0    10G  0 part├─sdg2      8:98   0    10G  0 part│ └─md125   9:125  0    10G  0 raid1├─sdg3      8:99   0    40G  0 part└─sdg4      8:100  0 610.6G  0 partsr0        11:0    1   693M  0 rom   /run/archiso/bootmnt# cat /proc/mdstatPersonalities : [raid1]md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]      10476544 blocks super 1.1 [2/2] [UU]      bitmap: 0/1 pages [0KB], 65536KB chunkmd126 : active (auto-read-only) raid1 sdb3[2]      41909248 blocks super 1.1 [2/1]       bitmap: 1/1 pages [4KB], 65536KB chunkmd127 : active (auto-read-only) raid1 sdb1[2]      10476544 blocks super 1.1 [2/1]       bitmap: 1/1 pages [4KB], 65536KB chunkunused devices: <none>I could see the RAID devices probed as sdb and sdg, with partitions 1, 2, 3 configured but only partition 2 correctly in sync. The 4th partition is used for NFS in the CVM (ie. fast storage for the cluster).So my solution wasSet the devices I needed to modify back to writable mode# mdadm --readwrite md126# mdadm --readwrite md127# cat /proc/mdstatPersonalities : [raid1]md125 : active (auto-read-only) raid1 sdg2[1] sdb2[2]      10476544 blocks super 1.1 [2/2] [UU]      bitmap: 0/1 pages [0KB], 65536KB chunkmd126 : active raid1 sdb3[2]      41909248 blocks super 1.1 [2/1]

Sergiy Lozovsky · Answer

If there is access to CVM console.Reboot CVM;At Grub menuselect “Debug Shell”;At the shell prompt do “. modules.sh” (that loads required drivers);Assemble RAID arrays “mdadm --assemble --scan --run”;Check if RAID arrays have two drives each (“cat /proc/mdstat”);If RAID arrays are assembled incorrectly reassemble them as was describe in the previous comments. (mdadm /dev/md127 -a /dev/sdg1)Reboot CVM (from the hypervisor).There are some external links on rebuilding madam arrays, likehttps://www.thomas-krenn.com/en/wiki/Mdadm_recovery_and_resync

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded