Single node randomly going unresponsive

Single node randomly going unresponsive 10 years 1 week ago #717

Matt Low
Topic Author
Offline
Posts: 6

Hi
I'm fairly new to virtual machines; have successfully installed HA-lizard using the noSAN installer script on a 2-node pool and have tested various HA features with varying levels of success. Started the work last Friday on two new Dell PowerEdge R230 servers.

The original master of the pool had a bit of a round-about installation process as I figured out what was and wasn't possible with the dell software raid controller, and ended up going with mdadm to implement a raid 10 array across our 4-bay servers, and used that device for replication storage.

All was working peachy except that 3 times now, the original master of the pool has suddenly gone unresponsive (2 times while it was pool master and once while it was pool slave) with no logs available at the time it goes unresponsive to see what is going on.

iDRAC reports the server is online, but can't ping either the replication IP or the management IP from the newly promoted master. HA functions correctly migrating the VMs to new master.

My main question before I start further troubleshooting is whether it's possible to perform a fresh install of XS on the troubled node, join it to the original, and run the noSAN installer script on it with all of the originally used parameters. Will HA-lizard on the other side pick it up and start drbd syncing?

Thanks for the support and thanks for the software!

Please Log in or Create an account to join the conversation.

Single node randomly going unresponsive 10 years 1 week ago #718

Matt Low
Topic Author
Offline
Posts: 6

An example of the varying success: After the node went unresponsive last, I cold booted it through iDRAC and it was picked up by the pool again. I can't ping the shared replication IP on it. I can if I add a static route to the master.

EDIT: after running xe-toolstack-restart on the master, I was able to ping the shared IP again.

My installation seems buggered.. I had installed Xenserver 6.5SP1 but no other hotfixes before the install of ha-lizard.

Please Log in or Create an account to join the conversation.

Last edit: by Matt Low. Reason: more info

Single node randomly going unresponsive 10 years 1 week ago #720

Salvatore Costantino
Offline
Posts: 728

The installer is really designed for new installations only where there is no chance of destroying data. If you want to give it a try, you should be able to rebuild one of the nodes from scratch with a few small changes:

- make sure to join the new server to the pool before running the installer otherwise the installer will exit due to the host not being part of a 2-node pool. If you have trouble joining it to the pool before running the installer you will need to comment out the following lines in the installer before running:
######################################
# Check required hosts in pool
######################################
POOL_HOSTS=`xe host-list --minimal`
if [ ${#POOL_HOSTS} -ne 73 ]
then
echo "Installer requires a pool with 2 hosts. Exiting.."
exit 1
fi

- MAKE SURE the new host is a slave otherwise it will try to overwrite the contents of the peer server. TO be safe, you should also comment out the following lines from the installer:
if [ $STATE = "master" ]
then
echo -n "Synchronizing storage with peer/slave host"
drbdadm -- --overwrite-data-of-peer primary iscsi1
fi

Once the installation is complete, you should manually start the replication of your data to the new server making sure it is done in the correct direction.
drbdadm -- --overwrite-data-of-peer primary iscsi1

The following user(s) said Thank You: Matt Low

Please Log in or Create an account to join the conversation.

Single node randomly going unresponsive 10 years 1 week ago #721

Matt Low
Topic Author
Offline
Posts: 6

sc wrote: Once the installation is complete, you should manually start the replication of your data to the new server making sure it is done in the correct direction.
drbdadm -- --overwrite-data-of-peer primary iscsi1

Just to clarify, that should be run from the existing master once the slave is in the pool and has ha-lizard installed? A little cautious since that command sounds like the primary storage is being rewritten.

And after that, replication will happen automatically?

I'll give this a try, with backups on hand

Thanks for the fast response!

Please Log in or Create an account to join the conversation.

Single node randomly going unresponsive 10 years 1 week ago #722

Salvatore Costantino Offline Posts: 728	Correct, that should be done on the master. You may want to check to ensure that the DRBD resource is in the connected state before attempting to sync.
	Please Log in or Create an account to join the conversation.

Single node randomly going unresponsive 10 years 4 days ago #724

Matt Low Topic Author Offline Posts: 6	Hi, I reinstalled XS on the problem server and rejoined it to the pool. The bonding interface and management interface are set with the same IPs as before, and it's automatically connected to the shared iSCSI storage of the master server. I haven't installed HA-lizard yet and thus drbd hasn't synced yet. I'm wondering if it's still safe to do so at this point.. thanks!
	Please Log in or Create an account to join the conversation.

TOPIC:

Single node randomly going unresponsive 10 years 1 week ago #717

Single node randomly going unresponsive 10 years 1 week ago #718

Single node randomly going unresponsive 10 years 1 week ago #720

Single node randomly going unresponsive 10 years 1 week ago #721

Single node randomly going unresponsive 10 years 1 week ago #722

Single node randomly going unresponsive 10 years 4 days ago #724

Company

Links

Products