Solved

CVM Crashed

  • 18 March 2020
  • 34 replies
  • 15536 views

Badge +2

All,

I have a 6 Node 1065 system in 2 blocks.  Recently one of the CVMs (node 2, block A) crashed.  When rebooted it was just going in loops.  When diagnosed it seems the SSD (not the SATADOM) had failed and we replaced it.  When we try to boot the CVM, it still just loops.

We were told to boot that node with Phoenix which the cluster provided me for download.  I do that and it doesn’t load Phoenix and gets errors instead. 

I’m looking for a suggestion of how to get the node back to 100%.  At this point (and throughout) the ESXi on the SATADOM has booted fine and I guess if I didn’t care about the storage side I could just ignore this but I’d like the system to be fully healthy.

Any suggestion about how to get the CVM working again would be appreciated.

Thank you

Johan

icon

Best answer by JohanITG 28 April 2020, 18:10

View original

This topic has been closed for comments

34 replies

Userlevel 3
Badge +13

ok that’s seems to be the problem. The SSD model detected is unsupported for boot drive. We have a list of supported SSDs to be used for boot drive and i can’t find this model in the list. I’m not able to see any Western digital drives in our supported list.

Badge +2

I’m more than willing to try something else.  can you send me the list?  Or is there a way to ignore the “supported” drives?

 

Userlevel 3
Badge +13

We don’t support ignoring the compatibility list since this node with an unqualified ssd can cause performance issues to other nodes. Also in the initial picture you shared with us the model says Intel SSD. Was it the old drive? Your sales engineer should be able to provide you the list of supported SSDs

Badge +2

Hi there,

so I got and installed the new SSD from Sales.  I ran through the setup and it ended here (see photo).

 

What should I do now?

 

 

Badge +2

Final update.  Managed to get the SSD seen by the host and did a full wipe and re-install.  Host is now back to being part of the cluster.

 

Thanks for the help!

Badge +2

I take it back…  I thought I was done.  Cluster Expansion failed:

 

Help?

Badge +2

All, there was an issue with RAM allocation on the CVM that was throwing a hidden error.  Once RAM was increased the CVM joined the cluster and we’re good.

 

Thanks

Userlevel 3
Badge +4

Hello @JohanITG 

I believe it’s already clear, but up to the last update here the basic problem is the CVM boot disk failed and installing a CVM boot disk to the replacement SSD also failed.

I noted from your phoenix install attempt screenshot earlier, the error “No suitable SVM boot disk found” was reported.

Typically when this error comes up it is a result of checking the detected SSD model against the list of supported SSD hardware models. In other words, this error shows the Phoenix software refusing to install a CVM on a disk which we have not qualified for use as a CVM boot disk.

From your “lsscsi” output in the last image I’m seeing WDS100T1R0A which a brief search identifies as a Western Digital Red SA500 model SSD. Although a few western digital models have passed qualification I don’t think this one has been qualified for use as a CVM boot disk. I do not see this model string reflected as a supported drive model.

If it’s the drive model I mentioned above, I don’t believe it could be a qualified drive because the spec I found for this drive shows 0.33 DWPD (drive writes per day, an SSD endurance rating standard metric) and I believe 3 DWPD is a requirement for a CVM boot disk.

Maybe not the answer you were hoping for but unless I’ve missed something that’s why the phoenix process isn’t working here. 

Badge +2

Hi Jeremy, Thanks for the note.  That’s what Chandru had said at the end of the day as well.  We got a new drive from sales (which took a while) and then got that installed and things started to move forward.  The posts in the last 48 hours have been with the new drive installed.  Things are good and the cluster is back to full strength.

 

Thanks