Forum
Welcome, Guest
Username: Password: Remember me

TOPIC: iscsi transition to Primary failed: exit code 17

iscsi transition to Primary failed: exit code 17 3 weeks 6 days ago #1433

Hi,

We have a simple 2node cluster, both having two NICs and two HDDs.
One HDD for XenServer itself and the second for the drbd iscsi storage which is replicated over the second NICs (DRBD Bond). Pretty much the default values from the ha-lizard installer.

XenServer 7.0
ha-lizard Version: 2.1.3
iSCSI-HA Version IHA_2.1.4_29881

Both iscsi states were UpToDate and HA was enabled, 5 VMs running fine.

The problem occured when I ran a test of the HA functionality by killing the master using this command:
echo b > /proc/sysrq-trigger

Our custom Fencing-script worked perfectly, rebooted the failed host immediately into a recsue-PXE-image and wrote logs and everything.

The former slave transitioned into the master but wasn't able to start the VMs because the iscsi storage failed to start.
iscsi-cfg status showed the following:
Nov 18 09:11:19 xenhacl02h02 iscsi-ha:  Checking if this host is a Pool Master or Slave
Nov 18 09:11:19 xenhacl02h02 iscsi-ha:  This host's pool status = master
Nov 18 09:11:19 xenhacl02h02 iscsi-ha: 28407 service_execute: Execute [ status ] on [ iscsi-ha ]
Nov 18 09:11:19 xenhacl02h02 iscsi-ha: 28407 service_execute: System V mode detected
Nov 18 09:11:19 xenhacl02h02 iscsi-ha:  auto_plug_pbd: Found LVMoISCSI SR List: 93413ddf-30d7-9189-fa2f-4cd83227e82d
Nov 18 09:11:19 xenhacl02h02 iscsi-ha: 28407 service_execute: [  OK  ]#015iscsi-ha running: 3333
Nov 18 09:11:19 xenhacl02h02 iscsi-ha: 28407 service_execute: Returning exit status [ 0 ]
Nov 18 09:11:19 xenhacl02h02 iscsi-ha: 28407 DRBD Running on this host: version: 8.4.3 (api:1/proto:86-101) srcversion: FB3AC7056350AC64629E395 1: cs:WFConnection ro:Secondary/Unknown ds:Inconsistent/DUnknown C r----- ns:0 nr:255914536 dw:255914536 dr:0 al:0 bm:271 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1992704
Nov 18 09:11:19 xenhacl02h02 iscsi-ha-NOTICE-/etc/iscsi-ha/init/iscsi-ha.mon: Scanning for Volume Group -> iscsi-sr: 93413ddf-30d7-9189-fa2f-4cd83227e82d
Nov 18 09:11:19 xenhacl02h02 iscsi-ha: 28407 check_drbd_resource_state: DRBD Resource: iscsi1 in Secondary mode, expected in Primary mode
Nov 18 09:11:19 xenhacl02h02 iscsi-ha-ERROR-/etc/iscsi-ha/init/iscsi-ha.mon: 1: State change failed: (-2) Need access to UpToDate data
Nov 18 09:11:19 xenhacl02h02 iscsi-ha-ERROR-/etc/iscsi-ha/init/iscsi-ha.mon: Command 'drbdsetup-84 primary 1' terminated with exit code 17
Nov 18 09:11:19 xenhacl02h02 iscsi-ha: 28407 DRBD Resource: iscsi1 failed transition to Primary
Nov 18 09:11:19 xenhacl02h02 iscsi-ha: 28407 Aborting promote to primary

So when the former master failed, the slaves iscsi storage went from UpToDate to Inconsistent.
Then transition to primary terminated with exit code 17.

I was able to start the iscsi with the following commands:
# on the new master:
drbdadm -- --overwrite-data-of-peer primary iscsi1
# on the failed host after recovery:
drbdadm -- --discard-my-data connect all
After that the drbd synced for a few minutes and then it was UpToDate/UpToDate again.
All VMs were running again and I enabled HA.

My question is: How can i prevent the storage from running into Inconsistent state next time?
What could have caused the inconsistency and the exit code 17?

Kind Regards
Last Edit: 3 weeks 6 days ago by Robert Schuh.
The administrator has disabled public write access.

iscsi transition to Primary failed: exit code 17 3 weeks 5 days ago #1435

This is purely DRBD. It will not allow promoting a host if it thinks the peer's data is more up to date. I can see this happening if the failed host rebooted very quickly, became the slave, yet had more up to date data. Without having the timing of events at hand it is hard to tell for sure.

Regardless, DRBD will do this from time to time and does so to preserve data integrity. We have also intentionally decided to not put in place any recovery logic to deal with this since the user should really decide which host's data should survive such an event. It would be good to know if your crashed host rebooted in less than a minute or so, since that is what I am speculating is the root cause, which could be handled quite easily with a patch
The administrator has disabled public write access.
The following user(s) said Thank You: Robert Schuh

iscsi transition to Primary failed: exit code 17 3 weeks 4 days ago #1438

Thanks for your quick reply.

The former Master rebooted immediately into a rescue Linux system, where drbd software is not installed. The second NIC which is used for drbd-sync is not configured in this rescue System.
But the mgmt-interface is configured and reachable via ping.
Is that enough for the new Master to think the old Master is up?

i think drbd uses the 10.10.10.x addresses on the second NIC to see if the Primary is up.

I keept the recue System running for about 10 minutes, before i rebooted it into Xen again.

Letting the user decide which host's data should be kept makes sense.

I'll run nother test on Friday to see if the problem occurs again and let you know about the outcome.
The administrator has disabled public write access.
Time to create page: 0.071 seconds