Question

create file server failed Error creating: Invalid CVM subnet configuration, possibly due to incomplete CVM re-ip workflow

  • 11 April 2020
  • 12 replies
  • 1261 views

Badge +2

Hi All,

 

I’ve testing old Nutanix NX box demo with no MA support.  The AOS version install is 5.10.6

 

I tried to create file server using wizard in prism element. 

In the last step of wizard when I prompt create the error message showing 

 

Error creating: Invalid CVM subnet configuration, possibly due to incomplete CVM re-ip workflow

 

I’ve run ncc check and didn’t found subnet mismatch issue.

 

Please kindly advice how to fix this.

         

12 replies

Userlevel 2
Badge +4

Hi kovitking

 

If you have the following error:

 

 

The cause here is that manually updating the CVM's netmask will not update the value of 'external_subnet' in Zeus. This prevents the Data Services IP from communicating with the FSVMs and in turn not being able to mount the zpools.

Note: The proper way to update CVM IP and/or subnet mask configuration is documented here

 

Please have the following information ready :


1) Verify the CVM IP and subnet via ifconfig.
 

allssh "ifconfig eth0 | grep inet\ "


Example:
 

nutanix@cvm:~$ allssh "ifconfig eth0 | grep inet\ " ================== 10.58.85.39 ================= inet 10.58.85.39 netmask 255.255.255.0 broadcast 10.58.85.255 ================== 10.58.85.40 ================= inet 10.58.85.40 netmask 255.255.255.0 broadcast 10.58.85.255 ================== 10.58.85.41 ================= inet 10.58.85.41 netmask 255.255.255.0 broadcast 10.58.85.255


2) Verify what zookeeper has for the external_subnet:

zeus_config_printer | grep ^external_subnet


Example:
 

nutanix@cvm:~$ zeus_config_printer | grep ^external_subnet external_subnet: "10.58.84.0/255.255.252.0"


3) If the values are different, pleae OPEN a CASE in Nutanix Support and have a Senior SRE to correct this problem.

 

Hope this helps

 

Regs.

 

Badge +2

Hi kovitking

 

If you have the following error:

 

 

The cause here is that manually updating the CVM's netmask will not update the value of 'external_subnet' in Zeus. This prevents the Data Services IP from communicating with the FSVMs and in turn not being able to mount the zpools.

Note: The proper way to update CVM IP and/or subnet mask configuration is documented here

 

Please have the following information ready :


1) Verify the CVM IP and subnet via ifconfig.
 

 allssh "ifconfig eth0 | grep inet\ "


Example:
 

 nutanix@cvm:~$ allssh "ifconfig eth0 | grep inet\ " ================== 10.58.85.39 ================= inet 10.58.85.39 netmask 255.255.255.0 broadcast 10.58.85.255 ================== 10.58.85.40 ================= inet 10.58.85.40 netmask 255.255.255.0 broadcast 10.58.85.255 ================== 10.58.85.41 ================= inet 10.58.85.41 netmask 255.255.255.0 broadcast 10.58.85.255


2) Verify what zookeeper has for the external_subnet:

 zeus_config_printer | grep ^external_subnet


Example:
 

 nutanix@cvm:~$ zeus_config_printer | grep ^external_subnet external_subnet: "10.58.84.0/255.255.252.0"


3) If the values are different, pleae OPEN a CASE in Nutanix Support and have a Senior SRE to correct this problem.

 

Hope this helps

 

Regs.

 

 

nutanix@CVM:172.16.1.32:~$ allssh "ifconfig eth0 | grep inet\ "
================== 172.16.1.33 =================
        inet 172.16.1.33  netmask 255.255.254.0  broadcast 172.16.1.255
================== 172.16.1.32 =================
        inet 172.16.1.32  netmask 255.255.254.0  broadcast 172.16.1.255
nutanix@CVM:172.16.1.32:~$ zeus_config_printer | grep ^external_subnet
external_subnet: "172.16.1.0/255.255.254.0"
 

HI Thank for your suggestion. 

The value is same.

For this node I’m not able to open the case due MA has expired. This is a PoC assets.

Userlevel 2
Badge +4

I would recommend to upgrade the cluster to AOS 5.10.7 or 5.11.1 because restarting genesis on a two node cluster with AOS 5.10.6 could lead to cluster instabillity.

After Upgrade ty to Re-Deploy the File Server using Prism and let us know.

Meanwhile I am searching for an alternate solution.

Regs.

 

Badge +2

I would recommend to upgrade the cluster to AOS 5.10.7 or 5.11.1 because restarting genesis on a two node cluster with AOS 5.10.6 could lead to cluster instabillity.

After Upgrade ty to Re-Deploy the File Server using Prism and let us know.

Meanwhile I am searching for an alternate solution.

Regs.

 

Hi @AntonioG 

 

I’m follow KB https://portal.nutanix.com/page/documents/kbs/details/?targetId=kA00e000000CrNuCAK for upgrade two-node cluster.

 

However, during manual upgrade is error as below message.

 

020-04-14 12:03:24 WARNING preupgrade_checks.py:815 Skipping replication factor check since cluster is stopped
2020-04-14 12:03:25 INFO multihome_utils.py:146 Cluster does not have multi homed CVMs
2020-04-14 12:03:25 ERROR preupgrade_checks.py:163 Cannot upgrade two node cluster when cluster has a leader fixed. Current leader svm id: 4. Try again after some time , Please refer KB 6396
2020-04-14 12:03:25 INFO preupgrade_checks.py:978 Cluster is stopped, skipping under-replication test
2020-04-14 12:03:25 INFO preupgrade_checks.py:1849 Skipping version compatibility test
2020-04-14 12:03:25 WARNING preupgrade_checks.py:772 Cluster has less than 3 nodes. Downtime possible
2020-04-14 12:03:25 ERROR cluster_upgrade.py:352 Failure in pre-upgrade tests, errors Cannot upgrade two node cluster when cluster has a leader fixed. Current leader svm id: 4. Try again after some time , Please refer KB 6396
Signature validation Error for version 5.10.7 on svm 172.16.1.32. Error: Failed to verify NOS installer signature on svm 172.16.1.32, Please refer KB 6108
2020-04-14 12:03:25 ERROR cluster:1867 Failed to perform cluster upgrade
2020-04-14 12:03:25 ERROR cluster:2815 Operation failed
 

 

 

I’ve checked the MD5 it’s correct. 

Userlevel 2
Badge +4

The ERROR message:

Cannot upgrade two node cluster when cluster has a leader fixed…

means that the cluster is under-replicated.

Curator is responsible for kicking off replication for all extent groups that are not adequately replicated. A Curator full scan is needed to replicate the under-replicated data.

Solution:

  • Refer to KB 2826. Wait for cluster data to be rebalanced across nodes and Current Fault Tolerance to show 1.
  • Once the curator scan has completed, run the pre-upgrade check again. It could be that it takes a couple of scans dependent on the number of underreplicated egroups.

Regs.

Antonio

Badge +2

The ERROR message:

Cannot upgrade two node cluster when cluster has a leader fixed…

means that the cluster is under-replicated.

Curator is responsible for kicking off replication for all extent groups that are not adequately replicated. A Curator full scan is needed to replicate the under-replicated data.

Solution:

  • Refer to KB 2826. Wait for cluster data to be rebalanced across nodes and Current Fault Tolerance to show 1.
  • Once the curator scan has completed, run the pre-upgrade check again. It could be that it takes a couple of scans dependent on the number of underreplicated egroups.

Regs.

Antonio

Hi @AntonioG 

Sorry, It’s just upgrade successfully. 

 

But the file server deployment still failed with same error.

Userlevel 2
Badge +4

I need more specific information regrading your cluster, could you please add the following:

  1. Screenshot of the Create File Server Screen from Prism
  2. From any CVM, please provide the output from the following commands:
  • ncli cluster info

  • ncli host ls

  • ncli alert ls

  • cluster status | grep -v UP

  • nodetool -h 0 ring

  • ncli cluster get-domain-fault-tolerance-status type=node

  • ncc health_checks run_all

Note: For a two node cluster It is only possible to deploy one one Files FSVM with  no Distributed shared.

Please also review the prerequisites for Files 

Regs.

Antonio

Badge +2

Hi @AntonioG 

As your requested.

 

  1. File server creation step
  1. command result
  2. ncc output
Userlevel 3
Badge +4

Have you tried with a mathematically valid subnet?

You specified your network as 172.16.1.0 / 255.255.254.0 

The network address cannot be x.x.1.0 for this subnet mask. That “1” in your third octet is a 1 in the last bit, 00000001, but that bit is masked in this configuration since your subnet mask is functionally aaaaaaaa.bbbbbbbb.cccccccX.XXXXXXXX   (X being the masked bits).

172.16.1.0 can only be the network address of a 255.255.255.0 subnet or smaller. 

I recognize correcting this configuration would mean shutting down the cluster and running the IP reconfig script. It also could mean some other adjustments to the network? Not sure what’s going on there. 

I wouldn’t be at all surprised if this is why your network validation fails in the creation process.

Userlevel 2
Badge +4

Please also correct the following  FATAL alarm reported :

 

FAIL: CVM is not uplinked to any 10Gbps nics on bridge/vSwitch br0.Node 172.16.1.32:FAIL: CVM is not uplinked to any 10Gbps nics on bridge/vSwitch br0.Refer to KB 1584 (http://portal.nutanix.com/kb/1584) for details on 10gbe_check

 

Follow the KB 1584 on AHV section.

This will allow to have 10gb on vSwitch on the affected host (172.16.1.32) as the AHV Network Best Practices recommend.

 

Regs,

Antonio

Badge +2

Have you tried with a mathematically valid subnet?

You specified your network as 172.16.1.0 / 255.255.254.0 

The network address cannot be x.x.1.0 for this subnet mask. That “1” in your third octet is a 1 in the last bit, 00000001, but that bit is masked in this configuration since your subnet mask is functionally aaaaaaaa.bbbbbbbb.cccccccX.XXXXXXXX   (X being the masked bits).

172.16.1.0 can only be the network address of a 255.255.255.0 subnet or smaller. 

I recognize correcting this configuration would mean shutting down the cluster and running the IP reconfig script. It also could mean some other adjustments to the network? Not sure what’s going on there. 

I wouldn’t be at all surprised if this is why your network validation fails in the creation process.

 

Hi @JeremyJ  You’re right. This is wrong.

 

First of all, this node was configure as 255.255.255.0 subnet. Then our office has change network address to mask 23. 

 

So, I’ve run cluster reconfig and change zeus network address from 172.16.1.0/255.255.255.0 - 172.16.1.0/255.255.254.0 

 

I’ll find maintenance room and change zeus external network address to 172.16.0.0/255.255.254.0 later. 

 

Thank you.

         
Badge +2

@AntonioG  As I told you this is non-production cluster. Just for PoC or internal testing in my office. So, I just used 2x1gb interfaces for connectivity. 

Reply