Skip to main content

Hi there,

Trying to deploy Prism Central on Nutanix Community Edition I’m seeing the same symptoms as this one Error during deploy Prism Central
 

Encountered Exception in post_deployment step: Failed to enable micro services infrastructure on PC: deploy msp:Error deploying addons: failed to deploy monitoring addon: failed to deploy and verify kube-prometheus: failed to verify kube-prometheus: Operation timed out: failed to verify kube-prometheus: expecting 1 available replica of k8s prometheus in ntnx-system namespace. Currently running: 0
 


I have tried multiple versions of PC with the same issue.

Looking inside the appliance I can see that the reason for Prometheus failing is it's waiting for the persistent volume to be provisioned

# Check the STS status

kubectl get sts -n ntnx-system prometheus-k8s

NAME READY AGE
prometheus-k8s 0/1 8h

# Find the Pod name

kubectl get po -n ntnx-system | grep prometheus-k8s

prometheus-k8s-0 0/2 Pending 0 8h

# Describe the pod (snipped)

kubectl describe po -n ntnx-system prometheus-k8s-0

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20m (x496 over 8h) default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
nutanix@NTNX-10-10-200-31-A-PCVM:~$

# Check the PV Claims

kubectl get pvc -n ntnx-system

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus-k8s-db-prometheus-k8s-0 Pending silver 8h

# Describe the PVC (snipped)

kubectl describe pvc -n ntnx-system prometheus-k8s-db-prometheus-k8s-0

Name: prometheus-k8s-db-prometheus-k8s-0
Namespace: ntnx-system
StorageClass: silver
Status: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 20m (x125 over 8h) csi.nutanix.com_ntnx-10-10-200-31-a-pcvm_e2ef0343-d205-47d8-beb2-cd277479e2c5 failed to provision volume with StorageClass "silver": rpc error: code = Internal desc = NutanixVolumes: failed to create volume: pvc-26a1d40b-85cb-4c37-888f-389c0b7f0c66, err: NutanixVolumes: failed to create REST client, error: Max retries done: Failed to authenticate GetVersions(): 401 Authorization Error - HTTP Response Code : 401
Normal Provisioning 4m3s (x129 over 8h) csi.nutanix.com_ntnx-10-10-200-31-a-pcvm_e2ef0343-d205-47d8-beb2-cd277479e2c5 External provisioner is provisioning volume for claim "ntnx-system/prometheus-k8s-db-prometheus-k8s-0"
Normal ExternalProvisioning 83s (x2102 over 8h) persistentvolume-controller waiting for a volume to be created, either by external provisioner "csi.nutanix.com" or manually created by system administrator

Looks like the CSI Controller is having authentication issues 

Failed to authenticate GetVersions(): 401 Authorization Error - HTTP Response Code : 401


If I look at the storage class configuration I can see the secret in use,

kubectl get sc silver -o yaml

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
ntnxClusterRef: 00063c6c-198a-2907-510e-00a09802c300
creationTimestamp: "2025-08-18T14:33:47Z"
name: silver
resourceVersion: "943"
uid: d57479a3-eaf5-4231-8263-ab590d4b6978
parameters:
csi.storage.k8s.io/controller-expand-secret-name: ntnx-csi-secret-sbwhh
csi.storage.k8s.io/controller-expand-secret-namespace: ntnx-system
csi.storage.k8s.io/controller-publish-secret-name: ntnx-csi-secret-sbwhh
csi.storage.k8s.io/controller-publish-secret-namespace: ntnx-system
csi.storage.k8s.io/fstype: ""
csi.storage.k8s.io/node-publish-secret-name: ntnx-csi-secret-sbwhh
csi.storage.k8s.io/node-publish-secret-namespace: ntnx-system
csi.storage.k8s.io/provisioner-secret-name: ntnx-csi-secret-sbwhh
csi.storage.k8s.io/provisioner-secret-namespace: ntnx-system
dataServiceEndPoint: dsip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local:3260
description: ""
flashMode: DISABLED
isSegmentedIscsiNetwork: "false"
storageContainer: NutanixManagementShare
storageType: NutanixVolumes
provisioner: csi.nutanix.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Resolving the Data Services Endpoint from within Prism Central correctly resolves my Prism Element Data Services IP

ping dsip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local

PING dsip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local (10.10.200.10) 56(84) bytes of data.
64 bytes from prism-data-services.<mydomain_was_here> (10.10.200.10): icmp_seq=1 ttl=64 time=0.921 ms

The secret contains a cert and endpoint

# Get the secret (snipped)

kubectl get secret -n ntnx-system ntnx-csi-secret-sbwhh -o yaml

apiVersion: v1
data:
cert: <base64 cert was here>
endpoint: cGVpcC4wMDA2M2M2Yy0xOThhLTI5MDctNTEwZS0wMGEwOTgwMmMzMDAucHJpc20tY2VudHJhbC5jbHVzdGVyLmxvY2FsOjk0NDA=
kind: Secret
metadata:
creationTimestamp: "2025-08-18T14:33:47Z"
name: ntnx-csi-secret-sbwhh
namespace: ntnx-system
resourceVersion: "942"
uid: 45701a4c-8052-46d7-af3c-c69436fdd392
type: Opaque

The endpoint is resolving to the local Prism Central IP

# Decode the base64 endpoint

echo
cGVpcC4wMDA2M2M2Yy0xOThhLTI5MDctNTEwZS0wMGEwOTgwMmMzMDAucHJpc20tY2VudHJhbC5jbHVzdGVyLmxvY2FsOjk0NDA= | base64 -d

peip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local:9440

# Resolve it

nslookup peip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: peip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local
Address: 10.10.200.20

Where is the authentication error coming from? when it accesses the local endpoint or the remote data services IP?

 

As the secret is generated automatically I must be missing some thing or have something mis-configured preventing authentication

Any ideas?

Not sure what it was but I ended up rebuilding the entire PE cluster and re-deploying and the error is gone.


Reply