Forum
Welcome, Guest
Username: Password: Remember me

TOPIC: XCP-NG 3 Node Pool 1 Node fails quorum once a day.

XCP-NG 3 Node Pool 1 Node fails quorum once a day. 2 months 1 week ago #1638

We are currently experiencing an issue where only one of our nodes keeps being fenced by HA-Lizard. The log shows that quorum check failed and then the host is fenced, however 1 second later the when HA-Lizard checks quorum again the node that was fenced passes the quorum check just fine. This keeps causing VM's to be moved to the other 2 nodes in the pool. We have checked all of the settings on the host that is having issues and cant seem to find what would cause it not to respond during the quorum check. Does anyone know what might be causing this issue.

Thanks

Mike
The administrator has disabled public write access.

XCP-NG 3 Node Pool 1 Node fails quorum once a day. 2 months 1 week ago #1639

Defaut HA parameters should tolerate health check failures for about 20 seconds. It's odd that you are seeing a 1 second failover. Have any of the default HA timers been changed?

Have you checked system level logs in case the interface goes down each time this happens?

Log snippets from the master and affected host would be helpful if you are able to post them.
The administrator has disabled public write access.

XCP-NG 3 Node Pool 1 Node fails quorum once a day. 2 months 1 week ago #1640

Salvatore,

Thank you for your response. I have been looking at logs but have not seen anything showing an interface was down. I will look more on Monday and I will try to add a screen shot of the logs on Monday as well. Thanks

Mike
The administrator has disabled public write access.

XCP-NG 3 Node Pool 1 Node fails quorum once a day. 2 months 5 days ago #1641

Salvatore,

Sorry for the delay had an emergency pop up that I have been working on. I have added the logs from the master and the slave that is having issues. It does appear that the slave which is xen3 is having a network issue but it never shows any drops any where else except in HA-lizard logs. Let me know what you think and if you need any more logs. Thanks
Attachments:
The administrator has disabled public write access.

XCP-NG 3 Node Pool 1 Node fails quorum once a day. 2 months 5 days ago #1642

Thanks for the logs. There is definitely a network issue going on when this occurs and it seems to be isolated to the host "xen3" which is unreachable. Xen3 also cannot communicate with the master as evidenced by the IP check failure and the failure to update its configuration just before the errors, which requires XAPI.

Have you checked xensource.log around the same timestamp to see if there are any network errors reported on xen3?

Since the network issue appears to clear in ~ <20 seconds, you can give the HA process more time by setting xapi_count to 5, which would give you 30 additional seconds for the issue to clear and avoid disruption of your running VMs. This is not a solution, but would give you time to sort out the network issue. From the CLI "ha-cfg set xapi_count 5". This can be run from any of the 3 hosts and it will update all the hosts in the pool with the new value.

While you are at it, if you can, check switch logs in case a switch port is having an issue.
The administrator has disabled public write access.
Time to create page: 0.136 seconds