Skip to main content
Solved

Prism Central Deployment fails (Prometheus)

  • August 18, 2025
  • 1 reply
  • 177 views

Hi there,

Trying to deploy Prism Central on Nutanix Community Edition I’m seeing the same symptoms as this one Error during deploy Prism Central
 

Encountered Exception in post_deployment step: Failed to enable micro services infrastructure on PC: deploy msp:Error deploying addons: failed to deploy monitoring addon: failed to deploy and verify kube-prometheus: failed to verify kube-prometheus: Operation timed out: failed to verify kube-prometheus: expecting 1 available replica of k8s prometheus in ntnx-system namespace. Currently running: 0
 


I have tried multiple versions of PC with the same issue.

Looking inside the appliance I can see that the reason for Prometheus failing is it's waiting for the persistent volume to be provisioned

# Check the STS status

kubectl get sts -n ntnx-system prometheus-k8s

NAME READY AGE
prometheus-k8s 0/1 8h

# Find the Pod name

kubectl get po -n ntnx-system | grep prometheus-k8s

prometheus-k8s-0 0/2 Pending 0 8h

# Describe the pod (snipped)

kubectl describe po -n ntnx-system prometheus-k8s-0

Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 20m (x496 over 8h) default-scheduler 0/1 nodes are available: 1 pod has unbound immediate PersistentVolumeClaims.
nutanix@NTNX-10-10-200-31-A-PCVM:~$

# Check the PV Claims

kubectl get pvc -n ntnx-system

NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus-k8s-db-prometheus-k8s-0 Pending silver 8h

# Describe the PVC (snipped)

kubectl describe pvc -n ntnx-system prometheus-k8s-db-prometheus-k8s-0

Name: prometheus-k8s-db-prometheus-k8s-0
Namespace: ntnx-system
StorageClass: silver
Status: Pending
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 20m (x125 over 8h) csi.nutanix.com_ntnx-10-10-200-31-a-pcvm_e2ef0343-d205-47d8-beb2-cd277479e2c5 failed to provision volume with StorageClass "silver": rpc error: code = Internal desc = NutanixVolumes: failed to create volume: pvc-26a1d40b-85cb-4c37-888f-389c0b7f0c66, err: NutanixVolumes: failed to create REST client, error: Max retries done: Failed to authenticate GetVersions(): 401 Authorization Error - HTTP Response Code : 401
Normal Provisioning 4m3s (x129 over 8h) csi.nutanix.com_ntnx-10-10-200-31-a-pcvm_e2ef0343-d205-47d8-beb2-cd277479e2c5 External provisioner is provisioning volume for claim "ntnx-system/prometheus-k8s-db-prometheus-k8s-0"
Normal ExternalProvisioning 83s (x2102 over 8h) persistentvolume-controller waiting for a volume to be created, either by external provisioner "csi.nutanix.com" or manually created by system administrator

Looks like the CSI Controller is having authentication issues 

Failed to authenticate GetVersions(): 401 Authorization Error - HTTP Response Code : 401


If I look at the storage class configuration I can see the secret in use,

kubectl get sc silver -o yaml

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
ntnxClusterRef: 00063c6c-198a-2907-510e-00a09802c300
creationTimestamp: "2025-08-18T14:33:47Z"
name: silver
resourceVersion: "943"
uid: d57479a3-eaf5-4231-8263-ab590d4b6978
parameters:
csi.storage.k8s.io/controller-expand-secret-name: ntnx-csi-secret-sbwhh
csi.storage.k8s.io/controller-expand-secret-namespace: ntnx-system
csi.storage.k8s.io/controller-publish-secret-name: ntnx-csi-secret-sbwhh
csi.storage.k8s.io/controller-publish-secret-namespace: ntnx-system
csi.storage.k8s.io/fstype: ""
csi.storage.k8s.io/node-publish-secret-name: ntnx-csi-secret-sbwhh
csi.storage.k8s.io/node-publish-secret-namespace: ntnx-system
csi.storage.k8s.io/provisioner-secret-name: ntnx-csi-secret-sbwhh
csi.storage.k8s.io/provisioner-secret-namespace: ntnx-system
dataServiceEndPoint: dsip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local:3260
description: ""
flashMode: DISABLED
isSegmentedIscsiNetwork: "false"
storageContainer: NutanixManagementShare
storageType: NutanixVolumes
provisioner: csi.nutanix.com
reclaimPolicy: Delete
volumeBindingMode: Immediate

Resolving the Data Services Endpoint from within Prism Central correctly resolves my Prism Element Data Services IP

ping dsip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local

PING dsip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local (10.10.200.10) 56(84) bytes of data.
64 bytes from prism-data-services.<mydomain_was_here> (10.10.200.10): icmp_seq=1 ttl=64 time=0.921 ms

The secret contains a cert and endpoint

# Get the secret (snipped)

kubectl get secret -n ntnx-system ntnx-csi-secret-sbwhh -o yaml

apiVersion: v1
data:
cert: <base64 cert was here>
endpoint: cGVpcC4wMDA2M2M2Yy0xOThhLTI5MDctNTEwZS0wMGEwOTgwMmMzMDAucHJpc20tY2VudHJhbC5jbHVzdGVyLmxvY2FsOjk0NDA=
kind: Secret
metadata:
creationTimestamp: "2025-08-18T14:33:47Z"
name: ntnx-csi-secret-sbwhh
namespace: ntnx-system
resourceVersion: "942"
uid: 45701a4c-8052-46d7-af3c-c69436fdd392
type: Opaque

The endpoint is resolving to the local Prism Central IP

# Decode the base64 endpoint

echo
cGVpcC4wMDA2M2M2Yy0xOThhLTI5MDctNTEwZS0wMGEwOTgwMmMzMDAucHJpc20tY2VudHJhbC5jbHVzdGVyLmxvY2FsOjk0NDA= | base64 -d

peip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local:9440

# Resolve it

nslookup peip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local
Server: 127.0.0.1
Address: 127.0.0.1#53

Name: peip.00063c6c-198a-2907-510e-00a09802c300.prism-central.cluster.local
Address: 10.10.200.20

Where is the authentication error coming from? when it accesses the local endpoint or the remote data services IP?

 

As the secret is generated automatically I must be missing some thing or have something mis-configured preventing authentication

Any ideas?

Best answer by MAHDTech

Not sure what it was but I ended up rebuilding the entire PE cluster and re-deploying and the error is gone.

This topic has been closed for replies.

1 reply

  • Author
  • Voyager
  • 1 reply
  • Answer
  • August 19, 2025

Not sure what it was but I ended up rebuilding the entire PE cluster and re-deploying and the error is gone.