For the last few days our X-Ray tests have been running to completion but not finishing. As a result the additional tests do not get run.
The curie log indicates "exiting _cluster_results_loop" but the tests hang open at 0% and they have to be cancelled.
Best answer by garyView original
Thanks for reporting this
@Roberteo We will probably need to engage the X-Ray developers. In the meantime, can you tell us which version of X-Ray you are running, and against what sort of targets (vCenter or Prism) as well as which tests are having issues. In particular are they customized or one of the standard -Xray scenarios?
Our test lab currently has four (4) clusters. Two (2) are ESXi and two (2) are AHV. Our development cluster (ESXi) where we installed X-Ray is currently three (3) nodes. From there we target the other three (3) clusters of four (4) nodes and when we need to target the development cluster we have X-Ray installed on the other ESXi cluster.
We have currently run ~400 tests and are only now getting into the IPMI related tests.
Initially we had issues that we have since smoothed over and were having fairly good success. Since then we vLan tagged the switch interfaces and started having issues that for the most part have been addressed (cross rack traffic still needs work).
The three tests from last night that 'hung' were VDI Simulator: Power Worker (50); Task Worker 50; and Knowledge Worker (100). They were targeted to the two (2) AHV clusters. In the past we had successfully run the three (3) hour tests in two (2) hours. After ~15 hours running over night we were forced to cancel these tests this morning.
The test logs can be exported from the Actions dropdown for the hung tests, and the server logs are exported from the gear icon dropdown menu in the top-right of the X-Ray UI.
I'd like to reiterate just to make sure I understand the issue correctly; Is this the same X-Ray 3.3 VM that has been used to run many tests successfully, but recently they have started to hang (specifically the VDI Simulator tests)?
Thank you very much!
Thank you for your reply.
I will forward you the logs in a few minutes (they are exported then deleted in the morning after review).
Yes, this is the same VM that successfully ran them before. We have switched over to the other X-Ray VM for now to continue testing...
Thanks for sending the logs. I see that one of the test threads is hanging, which is causing the thread join to block. The hung thread is responsible for updating the Prometheus configuration during the test. That thread also queries Prometheus to update the line plots.
This is enough information to identify a few places where timeouts need to be added to prevent the test from hanging altogether. If we can get closer to the root cause of what appears to be a performance issue, we may be able to make further fixes.
If you have a chance, would you please log in to the X-Ray VM with the failing tests and run a few commands? You can connect to the X-Ray VM via SSH with the username `nutanix` and password `nutanix/4u`. If you could please run the commands below and send the output, that would be very useful:
Is the X-Ray VM low on disk space?