Filesystem corruption on VM transferred to HAL

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1386

Mauritz
Topic Author
Offline
Posts: 43

12 hours in and we've not been able to recover any of the lost VM's - With little to no sleep I have gotten it down now to the most logical reason has to be DRBD split brain which I honestly cannot understand as HA-lizard was running, HA was enabled and DRBD indicated connected and Primary/Slave was uptodate. The initial transferred VM's were running perfectly easily for a hour so I had no indication that something will go wrong at a later stage, but it did in the most horrible fashion, and now we're stuck with our production website, asterisk server, 2 nameservers and our entire code repository down.

I have noted the the majority of the VM's that failed failed on master, and after a reboot going it would start up on slave. Moving it to master later indicated that there was no VM storage (the VM would try and boot but failed to find a disk), on slave the file was corrupted to the point that the VM filesystem cannot boot without any obvious fix.

I completely understand the risk involved in using 3rd party software, and more than ever understand HAL is free of charge. I wish I understood the various risks involved as this is now affecting our production enviroment. I've watched the youtube video probably 10 times during installation (everything went perfectly) and read the various documentation multiple times to ensure I understand the underlying technologies, but never in my wildest dreams would have predicted that this was even a possibility (and on such a large scale, not just 1 VM but practically 6 out of the 10 VM's that were moved is now corrupted in some way or form). The downside I'm faced with now is that it's impossible for me to tell whether there is any hopes in fixing the VM's and support itself will work out $500 which is obviously out of our financial scope. We've taken on a financial investment to setup our enviroment according to the design outlined in installation so even the idea of moving away from HA Lizard now places us in a position where we have purchased hardware which is now effectively pointless as the only other option to achieve HA is to have dedicated file storage and if we run the hosts as standalone we are no better off than before we started.

I'm pleading for anyone to please assist me as I am completely in the dark on this now and not even sure if I should continue trying to fix / recover or wait until someone responds. I know the forum is not a dedicated support channel but understand that this is the ONLY support channel apart from paying for support. I've been very active on the forum the last couple of weeks and have really tried my best to ensure that everything is working the HAL way so really hoping someone can assist urgently

Please Log in or Create an account to join the conversation.

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1387

Mauritz Topic Author Offline Posts: 43	I decided to take a chance and do a xfs_repair on one of the virtual machines with a filesystem corruption. The good news is that it appeared to have fixed the issue on the selected VM. I will perform the same on the other VM's now with the hopes that it will fix those and will wait until I get better clarity on why this entire ordeal has happened.
	Please Log in or Create an account to join the conversation.

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1388

Mauritz
Topic Author
Offline
Posts: 43

After 24 hours I am happy to report that I have been able to restart all of the XFS vm's by doing a xfs_repair on each one of them. I was still unable to start the VM's on the master server with some appearing to not have storage. I then went ahead and turned on manual mode, rebooted the slave, then made slave master and rebooted master. After that all of the VM's was able to start on both hosts. For now I am leaving the slave as primary/master.

I am also again able to migrate a VM from the pool to the standalone. The VM moves without any issues and no broken configuration is left behind.

This is troubling as I am not sure what caused this to happen in the first place. Although I know how to fix this in the future, I fear that if there is a possibility that there may be data loss. I understand the issue to be something related to DRBD but have no idea what caused the original issue as ha-cfg status displayed no issue (master was master, connected to DRBD and Primary/Secondary was both up to date.

Please Log in or Create an account to join the conversation.

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1389

Mauritz
Topic Author
Offline
Posts: 43

As murphy would have it (and thankfully so), I am able to replicate what happened. I have created a test VM now on the old standalone server and migrated it to the pool.

Enviroment:
All is in order, HAL is running and HA is enabled. I migrated a test VM created on the standalone server to the HAL pool. The VM booted up on the slave server (which was the previous master server from yesterday) and within seconds booted up on the master (the previous slave).

On login screen I am displayed blk_update_request: I/O error, dev xvda, sector .... Buffer I/0 error on dev dm-0, logical block 95556, lost async page write...

This is repeated with different blocks until a final metadata I/O error: block ... error 5 numblks 65. Log I/O Error Detected. Shutting down filesystem
Please umount the filesystem and rectify the problems.

I had to retype the above (so sorry for inconsistancies) but essentially this is the same thing that happened on the other VM's.

After forcefully rebooting the VM (it does not want to reboot the normal way) I am displayed on screen that there is no bootable devices. My only option is to now boot the VM on the slave which boots the VM in maintenance mode.

It's also important to note that whilst the above is happening, no other VM's are impacted.

Resolution:
This time around the filesystem however was mounted but the /dev/xvda1 could not be mounted to /boot could not be mounted and a chain of other errors is displayed using journalctl -xb after emergency login. I took a chance, tried to mount /dev/xvda1 which informed me that the structure needs cleaning and I then did a xfs_repair -L /dev/xvda1 which fixed the issue!

My best guess at the above is:

1. System is migrated to the HAL pool and some database corruption takes place (worth mentioning the current standalone server is a HP whereas the new server is Dell)
2. The server gets rebooted due to this and starts on the master server.
3. The master server (this is the part I don't quite understand) then does not have access to the disk to boot. Maybe the disk is still locked on the other server, I don't know.
4. Rebooting the server on the other member in the pool (the one which had a working virtual disk) then requires a filesystem repair (due to the original reason the VM crashes) and requires to know which partition cannot mount and hopefully taking the steps in xfs_repair can resolve the filesystem issue.
5. Once the above is done we can migrate the server to any of the 2 pool members without issues.

This has been quite a learning curve (one I wish never to have to undergo again) - I hope this gives better insight to my rambling the last 24 hours and hope you can give me some insight into the above for future reference. We still ahve around 15 VM's to migrate over so would like to leave the surprises for birthday parties!

Please Log in or Create an account to join the conversation.

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1390

Salvatore Costantino
Offline
Posts: 728

Looks like quite an ordeal. Exporting backups of your vms would have been a good idea before starting the migration process. Regardless, it does not appear that the data corruption is related to Hal or drbd given that you experienced the same issue moving from a Hal enviro to a standalone host. there is a similar issue reported here that could be related:

bugs.xenserver.org/browse/XSO-365?page=c...abpanel&showAll=true

Please Log in or Create an account to join the conversation.

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1391

Mauritz
Topic Author
Offline
Posts: 43

Haha you have no idea the mixed emotions I've experienced the last 24 hours. I was so excited to get HAL setup and working and then to see VM's crash like that, production ones as well! Hard times! And realising the mistake of not making a backup VM was probably the final blow. Anyways ...

I've now been able to reproduce this a couple of times and have narrowed the behaviour down to a running VM moved to a pool member. If I shutdown the VM before transferring I have not had any issues, it's purely a running VM (live migration). In this case it was 2 seperate enviroments (HP to Dell) so I assume there is something there, but I honestly don't know the stack that well down to know if the manufacturer / config would make such a lasting effect.

I'm going to run this about 20 times to ensure I have it sorted before I move further VM's to the pool. I'll keep this post updated, never know if it may help someone else in the future.

Please Log in or Create an account to join the conversation.

TOPIC:

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1386

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1387

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1388

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1389

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1390

URGENT: VMs corrupted after transferred to HAL 8 years 4 months ago #1391

Company

Links

Products