Forum
Welcome, Guest
Username: Password: Remember me
This is the optional category header for the Suggestion Box.

TOPIC: DRDB / Multipathing issues on restart

DRDB / Multipathing issues on restart 4 months 1 week ago #1405

It's truly like the worst of experiences is following me to our datacenter.

We installed a new switch last night at the datacenter. I switched both hosts in manual mode with HA disabled. Shutdown all of the VM's, shutdown slave and then shutdown master.

Afterwards both came perfectly up, iscsi connected and everything was what I would assume to be a perfect system. I booted up a couple of VM's and after they worked I did (and I believe this may have been the issue) I attempted to boot up about 10 VM's by selecting them in xenserver and start...

Important to note that at this stage iscsi was still in manual mode (with HA disabled). I could visibly see that I have put a lot more pressure on the 2 hosts than expected, the ones that was already on stopped responding, some even giving disk issues, and soon the iSCSI path was broken, started with multipathing failure. I could also not shutdown and VM's nor the hosts. The hosts itself was fine, could go to console but slave indicated it lost access to the iscsi SR which I believe was for most of it the issue.

I was unable to shutdown any of the hosts through xenserver as it appeared that both of the hosts lost connection to the iscsi repo and was eventually forced into doing a shutdown from console. The master host shutdown fine but the slave indicated tapdisk errors (forcefully shutdown due to failure).

The 2nd time around I booted the servers until both connected to the SR. I then switched off manual mode (I figured the original issue may have been booting the VM's on the wrong server, please correct me if I am wrong). I was able to start up the VM's individually this time around, but maybe still too quick after one another, and the same happened. The ones that booted up worked for about 5 minutes. Multipathing stopped and all of the VM's stopped responding with terminal kernel errors. I suspect this is to do that the VM no longer has access tot he SR. I had to do the same resolution (forcifully shutdown the hosts as I could not manually get them resolved).

This morning I booted up master, then slave, got them loaded and booted up the VM's one after the other, but gave about 5 minutes inbetween ... All but one VM so far booted. Of the lot 2 VM's (both Windows DC's) have corruption. One does not recognise an operating system and the other is in an automatic repair loop. Fortunately we have a backups but not the most ideal situation to be in.

My actual question here is, is the behaviour seen above on par with what would happen if the hosts / SR is under heavy load (like 10 VM's booting up at the same time) or is there another critical step I missed? I can confirm that from what I can see there was no data corruption (although it looked like it on screen). It really just looks like an underlying technology failed (multipathing looks like the main culprit). I want to prepare myself for such instances (and obviously learn from it). If this isn't the case I need to investigate further but would like your opinion on the matter?
Last Edit: 4 months 1 week ago by Mauritz.
The administrator has disabled public write access.
Time to create page: 0.067 seconds