Forum
Welcome, Guest
Username: Password: Remember me
  • Page:
  • 1
  • 2

TOPIC:

Split-brain after network interruption 10 years 5 months ago #157

  • pielala
  • pielala's Avatar Topic Author
  • Offline
  • Posts: 14
Hello,

After shutting down the port of one of our 2 nodes we found ourselves in a split-brain. Both nodes were running the same VMs. I had to reset drbd etc in order to solve this.

This is how "ha-cfg get" looks like:
www.fpaste.org/55118/74948138/raw/

I thought the logic is, if the other node can't be pinged, try to ping 8.8.8.8 (heuristics ip), if that fails too, shut everything down. I am right?

Please Log in or Create an account to join the conversation.

Split-brain after network interruption 10 years 5 months ago #158

Can you provide a few more details so that we can try to reproduce the scenario.

- Is there an external SAN or is this using local storage with the iscsi-ha module?

Which port was shut down?

- If iscsi-ha, was the DRBD link pulled ? If so, was it a bonded direct link between hosts?

- Was the management interface shut down?

In a 2 node pool, there are the following scenarios:

- From the Master perspective. If slave is not reachable and the heuristic IP is reachable, assume the slave is down and start all of its VMs

- From the Slave perspective - if the Master is not reachable and the heuristic IP is reachable, assume Master is down - become master and start VMs. If the Master is not reachable and the heauristic IP is not reachable, do nothing

Please Log in or Create an account to join the conversation.

Split-brain after network interruption 10 years 5 months ago #159

  • pielala
  • pielala's Avatar Topic Author
  • Offline
  • Posts: 14
Hi,

- there is no SAN involved, just ha-iscsi with local storage
- there is only one link, not bonded, for management and drbd, we shut down the switch port of one of the servers (not sure if it was slave of master at the time)

Regarding the Slave behaviour, "If the Master is not reachable and the heauristic IP is not reachable, do nothing" shouldn't that be changed to:
- if the master is unreachable and the heuristic IP is unreachable, then kill everything, because once the link with the master is re-established you're going to have a split-brain (same VMs running on both machines)?

Please Log in or Create an account to join the conversation.

Split-brain after network interruption 10 years 5 months ago #160

Thanks for the details. We will look at the logic and possibly update to avert this situation.

FYI - a production environment should not use a single interface for DRBD and Management interfaces. The reference design we published uses a direct bonded link for DRBD so that there are never any switch ports in between. In this case, should the management network be lost, the DRBD link will still be active and not allow the hosts to both be Primary. This practically eliminates the possibility of split brain.

Please Log in or Create an account to join the conversation.

Split-brain after network interruption 10 years 4 months ago #161

  • pielala
  • pielala's Avatar Topic Author
  • Offline
  • Posts: 14
Thanks.

This is a testing setup, but even in a production setup we would have the hypervisors in different cabinets, different PDUs, different edge routers for maximum redundancy. We might look at using the native Xenserver HA based on a minimum of 3 hypervisors, or maybe just pull a direct cable, but this might be messy.

Please Log in or Create an account to join the conversation.

Split-brain after network interruption 10 years 4 months ago #162

We were able to replicate your scenario. As a result, there was a small bug identified/fixed and logic was updated to avoid this situation when all communications are lost between hosts. Slaves will now hard reboot in this situation. When they return, their VMs will remain if the off state.

We have a patch release done, version 1.7.6. it will be tested over the next few days and released early in the week if testing is positive.
Thanks for the feedback and reporting your test results.
The following user(s) said Thank You: pielala

Please Log in or Create an account to join the conversation.

Last edit: by Pulse Supply. Reason: resolved
  • Page:
  • 1
  • 2